Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Linkerd doesn't self-heal when one node goes down on k3s cluster #7232

Closed
tobiasmuehl opened this issue Nov 7, 2021 · 5 comments
Closed
Assignees

Comments

@tobiasmuehl
Copy link

Bug Report

Running Linkerd in HA configuration, I expect that one of the nodes hardware dying means that any pods that are scheduled on that node will be re-scheduled on another node automatically. I also expect that the viz dashboard works during this node failure, as 2 out of 3 nodes are still available.

What is the issue?

Pods are not rescheduled on new hardware and the dashboard can stop working

How can it be reproduced?

Create 3 nodes cluster with k3s. Deploy linkerd in HA mode and viz in HA mode. Add another node to the k3s cluster. Kill one of the first three nodes hardware, but don't drain or remove the node with kubectl

Environment

  • Kubernetes Version: 1.21
  • Cluster Environment: k3s
  • Host OS: Ubuntu 20
  • Linkerd version: Client version: stable-2.11.1
    Server version: stable-2.11.1

Possible solution

Wondering if the anti-affinity rules somehow lock the pods onto the dead node and prevent rescheduling elsewhere

Additional context

https://linkerd.slack.com/archives/C89RTCWJF/p1635968159355300

@mateiidavid
Copy link
Member

Hi @tobiasmuehl, apologies for taking a bit of time with my reply, it's been a hectic week for me. We talked about this on Slack, iirc your pods were never rescheduled when killing the node (although the node did receive a NoScheudle taint).

For the purpose of this issue, I did not look into linkerd-viz (or its dashboard) as originally described on Slack. Instead, I sought to reproduce your issue as written here (on GH). I wanted to see Kubernetes reacts in my own environment. I want to note though that any issues with the scheduler not working (/misbehaving) are out of our hands, what I'm concerned with here is testing (and validating) our HA rules (anti-affinity, mainly).

I used k3d to bootstrap a cluster with 3 nodes; the nodes run as "master" nodes. I installed Linkerd stable-2.11.1 in HA mode. Even though I'm using the most recent stable version, afaik we haven't done much to change the HA configuration lately, so I'd assume this test would hold for older versions too. Here's my state at the start of the test:

:; k get nodes
NAME                       STATUS   ROLES                       AGE     VERSION
k3d-multiserver-server-0   Ready    control-plane,etcd,master   7m48s   v1.21.2+k3s1
k3d-multiserver-server-1   Ready    control-plane,etcd,master   7m36s   v1.21.2+k3s1
k3d-multiserver-server-2   Ready    control-plane,etcd,master   7m21s   v1.21.2+k3s1

:; linkerd version
Client version: stable-2.11.1
Server version: stable-2.11.1

:; kgp -n linkerd -o wide
NAME                                      READY   STATUS            RESTARTS   AGE   IP          NODE                       NOMINATED NODE   READINESS GATES
linkerd-destination-7dd8b9bd89-28wwz      0/4     PodInitializing   0          11s   10.42.2.4   k3d-multiserver-server-2   <none>           <none>
linkerd-destination-7dd8b9bd89-4zndw      0/4     PodInitializing   0          11s   10.42.0.8   k3d-multiserver-server-0   <none>           <none>
linkerd-destination-7dd8b9bd89-bl95t      0/4     PodInitializing   0          11s   10.42.1.8   k3d-multiserver-server-1   <none>           <none>
linkerd-identity-558c65db5f-77hgf         0/2     PodInitializing   0          11s   10.42.2.3   k3d-multiserver-server-2   <none>           <none>
linkerd-identity-558c65db5f-cl6x5         0/2     PodInitializing   0          11s   10.42.1.6   k3d-multiserver-server-1   <none>           <none>
linkerd-identity-558c65db5f-ws66b         1/2     Running           0          11s   10.42.0.6   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-8684c69f89-9wfzc   0/2     PodInitializing   0          11s   10.42.1.7   k3d-multiserver-server-1   <none>           <none>
linkerd-proxy-injector-8684c69f89-dnr4x   0/2     Running           0          11s   10.42.0.7   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-8684c69f89-tjmbc   0/2     PodInitializing   0          11s   10.42.2.5   k3d-multiserver-server-2   <none>           <none>

The next step was for me to register a new node using k3d's CLI and delete one of the existing nodes, again, using k3d (did not delete the node using kubectl).

:; k3d node delete k3d-multiserver-server-0
INFO[0000] Deleted k3d-multiserver-server-0

:; k3d node create newserver --cluster multiserver --role server
INFO[0000] Starting Node 'k3d-newserver-0'

After a while (~ 2 minutes), Kubernetes registered the new state:

# At first
:; k get nodes
NAME                       STATUS     ROLES                       AGE     VERSION
k3d-multiserver-server-0   Ready      control-plane,etcd,master   9m18s   v1.21.2+k3s1
k3d-multiserver-server-1   Ready      control-plane,etcd,master   9m6s    v1.21.2+k3s1
k3d-multiserver-server-2   Ready      control-plane,etcd,master   8m51s   v1.21.2+k3s1
k3d-newserver-0            NotReady   control-plane,etcd,master   4s      v1.21.2+k3s1

# After a while
:; k get nodes
NAME                       STATUS     ROLES                       AGE     VERSION
k3d-multiserver-server-0   NotReady   control-plane,etcd,master   11m     v1.21.2+k3s1
k3d-multiserver-server-1   Ready      control-plane,etcd,master   11m     v1.21.2+k3s1
k3d-multiserver-server-2   Ready      control-plane,etcd,master   11m     v1.21.2+k3s1
k3d-newserver-0            Ready      control-plane,etcd,master   2m13s   v1.21.2+k3s1

However, my pods were not moved yet. I did a bit of searching while I let k8s manage its state, I was curious to see by default, what the behaviour should be. I came across this issue in the k8s GH. One of the answers (granted, from 2017):

By default pods won't be moved for 5m minutes which is configurable via the following flag on the controller manager. --pod-eviction-timeout duration [ref].

By this point, my new node had already been registered for ~ 7 minutes

8m39s       Normal    Starting                  node/k3d-newserver-0            Starting kubelet.
8m39s       Warning   InvalidDiskCapacity       node/k3d-newserver-0            invalid capacity 0 on image filesystem
8m39s       Normal    NodeHasSufficientMemory   node/k3d-newserver-0            Node k3d-newserver-0 status is now: NodeHasSufficientMemory
8m39s       Normal    NodeHasNoDiskPressure     node/k3d-newserver-0            Node k3d-newserver-0 status is now: NodeHasNoDiskPressure
8m39s       Normal    NodeHasSufficientPID      node/k3d-newserver-0            Node k3d-newserver-0 status is now: NodeHasSufficientPID
8m39s       Normal    NodeAllocatableEnforced   node/k3d-newserver-0            Updated Node Allocatable limit across pods
8m39s       Normal    Synced                    node/k3d-newserver-0            Node synced successfully
8m36s       Normal    Starting                  node/k3d-newserver-0            Starting kube-proxy.
8m35s       Normal    RegisteredNode            node/k3d-newserver-0            Node k3d-newserver-0 event: Registered Node k3d-newserver-0 in Controller
8m29s       Normal    NodeReady                 node/k3d-newserver-0            Node k3d-newserver-0 status is now: NodeReady
7m42s       Normal    RegisteredNode            node/k3d-newserver-0            Node k3d-newserver-0 event: Registered Node k3d-newserver-0 in Controller

Shortly after, I checked my pods again, and they were rescheduled:

:; kgp -n linkerd -o wide
NAME                                      READY   STATUS        RESTARTS   AGE   IP          NODE                       NOMINATED NODE   READINESS GATES
linkerd-destination-7dd8b9bd89-28wwz      4/4     Running       0          10m   10.42.2.4   k3d-multiserver-server-2   <none>           <none>
linkerd-destination-7dd8b9bd89-4zndw      4/4     Terminating   0          10m   10.42.0.8   k3d-multiserver-server-0   <none>           <none>
linkerd-destination-7dd8b9bd89-bl95t      4/4     Running       0          10m   10.42.1.8   k3d-multiserver-server-1   <none>           <none>
linkerd-destination-7dd8b9bd89-h6cfj      4/4     Running       0          77s   10.42.3.4   k3d-newserver-0            <none>           <none>
linkerd-identity-558c65db5f-77hgf         2/2     Running       0          10m   10.42.2.3   k3d-multiserver-server-2   <none>           <none>
linkerd-identity-558c65db5f-cl6x5         2/2     Running       0          10m   10.42.1.6   k3d-multiserver-server-1   <none>           <none>
linkerd-identity-558c65db5f-q8g9j         2/2     Running       0          77s   10.42.3.7   k3d-newserver-0            <none>           <none>
linkerd-identity-558c65db5f-ws66b         2/2     Terminating   0          10m   10.42.0.6   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-8684c69f89-9wfzc   2/2     Running       0          10m   10.42.1.7   k3d-multiserver-server-1   <none>           <none>
linkerd-proxy-injector-8684c69f89-dnr4x   2/2     Terminating   0          10m   10.42.0.7   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-8684c69f89-tjmbc   2/2     Running       0          10m   10.42.2.5   k3d-multiserver-server-2   <none>           <none>
linkerd-proxy-injector-8684c69f89-ztcn9   2/2     Running       0          77s   10.42.3.5   k3d-newserver-0            <none> 

Conclusion: based on the comment (related to pod eviction time), k8s will evict pods from failed nodes after a grace period of around 5 minutes, by default (if the default hasn't been changed throughout the past couple of versions). It seems to have taken a bit longer than that for me, but Kubernetes as a system has a lot of syncing to do, so it's possible something just took a bit longer than expected.

Since I wasn't able to reproduce this, and my pods were re-scheduled successfully, I can't help but think this isn't really an issue with the way we do HA, but rather with your k8s environment. We made the necessary changes to support HA as suggested by the documentation in k8s, i.e using PDPs and affinity rules. These seem to hold; I had looked at our HA manifests when we talked on Slack and I found nothing wrong with them from a logical perspective.

I think in your case the environment misbehaved. As I said, we're not making any scheduling decisions ourselves, rather, we have additional information to aid the scheduler. That information seems to be good. We do not have any control over eviction.

Hope this makes sense. I'll keep this open for a while in case you have any follow-ups.

@tobiasmuehl
Copy link
Author

Thanks for the detailed response. It makes sense, yet I can't reproduce this correct behavior with clusters that use k3s instead of k3d, even after hours of waiting. The k3s installation isn't customized in any way, strictly following the docs. At least in my tests HA works correctly when a human runs the appropriate kubectl delete node command. Can you outline if linkerd still functions if only 2/3 critical component pods are online? Is there any risk of meshed production traffic being null-routed in such a scenario?

@mateiidavid
Copy link
Member

mateiidavid commented Nov 30, 2021

@tobiasmuehl I am pretty certain it works, but I wanted to test this out before replying. I managed to get around to this today, and here are my results.

Just as before, I started a 3 node k3d cluster and installed linkerd along with the viz extension, both of them in HA mode. I checked everything runs smoothly by opening up a dashboard and querying traffic stats on an example application through the viz extension. Finally, after veryifying everything, I removed the node using k3d. You can find the commands and order I ran them in below:

:; kubectl get nodes
NAME                       STATUS   ROLES                       AGE   VERSION
k3d-multiserver-server-0   Ready    control-plane,etcd,master   15m   v1.21.2+k3s1
k3d-multiserver-server-1   Ready    control-plane,etcd,master   15m   v1.21.2+k3s1
k3d-multiserver-server-2   Ready    control-plane,etcd,master   15m   v1.21.2+k3s1



:; kubectl get pod -n linkerd -o wide
NAME                                      READY   STATUS    RESTARTS   AGE     IP           NODE                       NOMINATED NODE   READINESS GATES
linkerd-destination-fc7fbfcd4-6fnwz       4/4     Running   0          4m25s   10.42.0.9    k3d-multiserver-server-0   <none>           <none>
linkerd-destination-fc7fbfcd4-7t7ww       4/4     Running   0          4m25s   10.42.1.5    k3d-multiserver-server-1   <none>           <none>
linkerd-destination-fc7fbfcd4-mpsxf       4/4     Running   0          4m25s   10.42.2.4    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-cn7wn         2/2     Running   0          4m25s   10.42.1.4    k3d-multiserver-server-1   <none>           <none>
linkerd-identity-84d877868f-f7jwm         2/2     Running   0          4m25s   10.42.2.3    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-xfzqf         2/2     Running   0          4m25s   10.42.0.8    k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-67zr4   2/2     Running   0          4m25s   10.42.0.10   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-dphn2   2/2     Running   0          4m25s   10.42.1.6    k3d-multiserver-server-1   <none>           <none>
linkerd-proxy-injector-7c8984f66d-qg4pd   2/2     Running   0          4m25s   10.42.2.5    k3d-multiserver-server-2   <none> 

:; linkerd viz stat deploy -n emojivoto
NAME       MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
emoji         1/1   100.00%   0.5rps           1ms           1ms           1ms          3
vote-bot      1/1   100.00%   0.1rps           1ms           1ms           1ms          1
voting        1/1   100.00%   0.0rps           1ms           1ms           1ms          3
web           1/1   100.00%   0.1rps           1ms           1ms           1ms          3


matei in matei-ams1-c3 in ☸ k3d-multiserver in linkerd2 on  main [$] via 🐹 v1.16.4 via  v16.4.1 via 🦀 v1.56.0
:; kubectl get nodes
NAME                       STATUS   ROLES                       AGE   VERSION
k3d-multiserver-server-0   Ready    control-plane,etcd,master   15m   v1.21.2+k3s1
k3d-multiserver-server-1   Ready    control-plane,etcd,master   15m   v1.21.2+k3s1
k3d-multiserver-server-2   Ready    control-plane,etcd,master   14m   v1.21.2+k3s1

# Delete random node
:; k3d node delete k3d-multiserver-server-1
INFO[0000] Deleted k3d-multiserver-server-1

# After 2 minutes, k8s updated its state
:; kubectl get nodes
NAME                       STATUS     ROLES                       AGE   VERSION
k3d-multiserver-server-0   Ready      control-plane,etcd,master   16m   v1.21.2+k3s1
k3d-multiserver-server-1   NotReady   control-plane,etcd,master   15m   v1.21.2+k3s1
k3d-multiserver-server-2   Ready      control-plane,etcd,master   15m   v1.21.2+k3s1

Next, I verified the dashboard is still accessible, which it was. I then went through the viz stat output once more. This time, two of my workloads weren't sending or receiving any traffic; my example app wasn't installed in HA mode so I had to re-roll the deployments (this wouldn't have an impact on linkerd functioning properly). After the workloads were re-deployed, viz stat successfully gave me back the metrics, despite only having 2/3 workloads available (some of the viz workloads were scheduled on the now-defunct node). I wanted to also verify the proxy is functioning, so I got the logs from a random workload pod.

You will notice in the logs, there is a warning message about the proxy not connecting to a destination endpoint. Consequently, th success rate also dropped for the pod. This is in my mind understandable, it considers an endpoint that cannot function, however, it does not restrict or have a larger impact on the environment. The control plane still routes pods to where they need to go (the proxy gets all destination endpoints through DNS). In a normal production environment, we'd see a dip in the success rate until the control plane would recover fully -- this is the responsibility of the k8s scheduler, like I mentioned in my previous post.

HA mode should allow everything from the control plane to viz stack to still function properly even in the event that a node goes down. You might see a dip in the success rate if all endpoints are considered and not cleaned up by Kubernetes -- in my case, k8s didn't deschedule my pods, but again, this is not related to linkerd, there's nothing we can do here -- but everything should still function. In my case, as soon as Kubernetes recovered and terminated my pods, the success rate also started recovering. Note: I did not add an extra node.

# All pods are still marked as running, despite node being down.

:; kgp -n linkerd -o wide
NAME                                      READY   STATUS    RESTARTS   AGE     IP           NODE                       NOMINATED NODE   READINESS GATES
linkerd-destination-fc7fbfcd4-6fnwz       4/4     Running   0          6m22s   10.42.0.9    k3d-multiserver-server-0   <none>           <none>
linkerd-destination-fc7fbfcd4-7t7ww       4/4     Running   0          6m22s   10.42.1.5    k3d-multiserver-server-1   <none>           <none>
linkerd-destination-fc7fbfcd4-mpsxf       4/4     Running   0          6m22s   10.42.2.4    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-cn7wn         2/2     Running   0          6m22s   10.42.1.4    k3d-multiserver-server-1   <none>           <none>
linkerd-identity-84d877868f-f7jwm         2/2     Running   0          6m22s   10.42.2.3    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-xfzqf         2/2     Running   0          6m22s   10.42.0.8    k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-67zr4   2/2     Running   0          6m22s   10.42.0.10   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-dphn2   2/2     Running   0          6m22s   10.42.1.6    k3d-multiserver-server-1   <none>           <none>
linkerd-proxy-injector-7c8984f66d-qg4pd   2/2     Running   0          6m22s   10.42.2.5    k3d-multiserver-server-2   <none> 

# Viz stat still works since viz is in HA mode
# but our web pod's success rate dipped
:; linkerd viz stat deploy -n emojivoto
NAME       MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
emoji         1/1   100.00%   0.8rps           1ms           1ms           2ms          3
vote-bot      1/1   100.00%   0.0rps           1ms           1ms           1ms          1
voting        1/1   100.00%   0.5rps           1ms           2ms           3ms          3
web           1/1    87.80%   0.7rps           3ms           5ms           9ms          3

# The address we cannot connect to corresponds to the destination pod on the terminated node
# however, the proxy still continues to work. Evidence comes from metrics collected,
# and TLS.
:; kubectl logs -n emojivoto deploy/web linkerd-proxy -f
Found 2 pods, using pod/web-84bdbd9f-lm8gv
[     0.000910s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.001472s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.001486s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.001491s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.001493s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.001496s]  INFO ThreadId(01) linkerd2_proxy: Local identity is web.emojivoto.serviceaccount.identity.linkerd.cluster.local
[     0.001499s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.001502s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.010527s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=web.emojivoto.serviceaccount.identity.linkerd.cluster.local
[     4.142324s]  WARN ThreadId(01) outbound:server{orig_dst=10.43.221.157:8080}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=10.42.1.5:8086}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

# Pods are finally marked as terminated
:; kubectl get pods -n linkerd -o wide
NAME                                      READY   STATUS        RESTARTS   AGE    IP           NODE                       NOMINATED NODE   READINESS GATES
linkerd-destination-fc7fbfcd4-5rjgn       0/4     Pending       0          8m6s   <none>       <none>                     <none>           <none>
linkerd-destination-fc7fbfcd4-6fnwz       4/4     Running       0          17m    10.42.0.9    k3d-multiserver-server-0   <none>           <none>
linkerd-destination-fc7fbfcd4-7t7ww       4/4     Terminating   0          17m    10.42.1.5    k3d-multiserver-server-1   <none>           <none>
linkerd-destination-fc7fbfcd4-mpsxf       4/4     Running       0          17m    10.42.2.4    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-cn7wn         2/2     Terminating   0          17m    10.42.1.4    k3d-multiserver-server-1   <none>           <none>
linkerd-identity-84d877868f-f4m9d         0/2     Pending       0          8m6s   <none>       <none>                     <none>           <none>
linkerd-identity-84d877868f-f7jwm         2/2     Running       0          17m    10.42.2.3    k3d-multiserver-server-2   <none>           <none>
linkerd-identity-84d877868f-xfzqf         2/2     Running       0          17m    10.42.0.8    k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-26k6s   0/2     Pending       0          8m5s   <none>       <none>                     <none>           <none>
linkerd-proxy-injector-7c8984f66d-67zr4   2/2     Running       0          17m    10.42.0.10   k3d-multiserver-server-0   <none>           <none>
linkerd-proxy-injector-7c8984f66d-dphn2   2/2     Terminating   0          17m    10.42.1.6    k3d-multiserver-server-1   <none>           <none>
linkerd-proxy-injector-7c8984f66d-qg4pd   2/2     Running       0          17m    10.42.2.5    k3d-multiserver-server-2   <none> 

With that being said, I did bump into an issue when trying to run linkerd viz stat after the pod was finally terminated by k8s:

Cannot connect to Linkerd Viz: The "tap-f86d97c48-hcsb6" pod is not running
Validate the install with: linkerd viz check

When we try to interact with viz, we do a quick API check to see if we can connect. The check will assert all pods are live, in this case, it obviously fails. This wasn't caught before because the pod was terminated. Even though the pod is terminated, we should still be able to interact with the metrics API, as a matter of fact, I did that in the snippets above. As an improvement, I think we can have more relaxed considerations when checking viz health if it's running in HA. I'll create an issue for it.

Hope this all makes sense!

Raised #7385 to discuss this further.

@tobiasmuehl
Copy link
Author

Thanks for your detailed investigation - I mostly understand what you said 😅

Sadly it seems like this issue can not be replicated with k3d. Perhaps there is a substantial difference in how removing nodes works on k3s vs k3d.

@mateiidavid
Copy link
Member

Going to be closing this. There isn't really a lot on our side that can be done. If you do have any more questions, or you'd like to discuss things further, let me know and we can reopen this. 😄

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants