Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

Closed
adleong opened this issue Jun 14, 2022 Discussed in #8641 · 16 comments · Fixed by linkerd/linkerd2-proxy#1758
Assignees
Milestone

Comments

@adleong
Copy link
Member

adleong commented Jun 14, 2022

Discussed in #8641

Originally posted by DarkWalker June 9, 2022

What's the issue

linkerd-proxy goes into a "broken" state after which all requests to a specific downstream service within the mesh fail with either INTERNAL or CANCELLED error (observed by the caller). Requests never make it to the target service.

Details about the setup

  • there are a total of 16 services in the mesh with 2 to 60 pods each, the total number of pods in the mesh is 278
  • all the services use grpc; LinkerD is used to to encrypt the traffic, other than that there is no encryption within the cluster
  • we recently migrated from 2.10 to 2.11.2 and this seems to be related, but there are other things in the cluster that changed recently, so we rolled everything back and trying to introduce changes one by one, we will have the same setup on 2.10 in about a week and then I will be able to tell for sure whenever it's 2.11.2 related or if we see a similar issue on 2.10

How the problem manifests itself

  • only some of the services within the mesh are affected and this seems to be pretty consistent across all the incidents (we had 3 so far); on top of it only certain pods are affected (e.g. for one service 3 out of 3 pods were affected, but for another 9 out of 12 were affected during first incident, then 3 out of 12 for the second one and then 5/12 for the 3rd)
  • when pods go into a bad state they do not recover automatically - one of the affected services was not critical, so I left 2 out of 3 pods in a bad state and they stayed in the bad state for more than 24 hours before we rolled back linkerd and had to restart all the deployments
  • the mitigation is to delete the pod (and then deployment recreates it), the new pods that come up work just fine
  • errors observed by grpc interceptor over the client report CANCELLED and INTERNAL errors
  • this happens only in production, we did a fair amount of load testing prior to the upgrade as well as deliberately trying to repro the issue after it happened the first time - but no repro so far outside of production load

What we see in linkerd-proxy container logs of affected pods

  1. first, there is an error thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13 - this message is printed once when pod goes into the bad state

  2. then following the error above there is one of two errors for every outgoing request from this pod:

  • either INFO ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:35024}: linkerd_app_core::errors::respond: Request failed error=buffer's worker closed unexpectedly, or

  • this one WARN ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:49336}: linkerd_app_outbound::http::server: Unexpected error error=buffered service failed: panic.

  1. buffer's worker closed unexpectedly is present in kube-proxy logs of unaffected pods which function just fine, but cancel sender lost only appears on the pods when they go into the bad state

Observations

  • Right before it happens, each time there are multiple instances of the downstream service start failing readiness probe (intentionally, this is how we control pods getting out of rotation when they get to the point of perf degradation due to large amount of traffic) - so it seems like right after a bunch of pods start failing readiness probe - linkerd-proxy on certain pods for services that depend on this one go into bad state and does not self-heal after everything is back to reporting "ready".
  • I can't be sure whenever there was a moment where all the replicas of that service failed the readiness check or not (so it's possible there was a brief period of downtime, not more than 10 seconds, where there was not a single replica available for the service; it may also be true that any given moment there was at least one replica available - our metrics have a resolution of 30 seconds and all client calls are retried, so the best conclusion I can make is that the total downtime, if it happened, was no longer than 10 seconds)
  • Out of 7 consumers only 3 were affected - and this has been consistent across all 3 incidents. Non-affected services were making either around 40,000 RPS (with 1,000 RPS from every replica for one service and 3,000 RSP from every replica on another one) or din't make any calls at all when the incidents happened (but made calls less than a minute after and those calls worked fine), affected services were making around 50 RPS (with ~4 RPS from every replica)
  • This issue seems to be similar to Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Attached are logs from running curl -v --data 'linkerd=trace,tower=trace' -X PUT localhost:28014/proxy-log-level between two buffer's worker closed unexpectedly which are followed by Handling error on gRPC connection code=Internal error, which seems to be what we observe on the application side (the IP addresses as well as some bits are redacted for security purposes, please let me know if it's a problem).

linkerd_proxy_trace.txt

I would appreciate any help in debugging this issue or tips on how to workaround/repro this in an isolated environment.

@olix0r
Copy link
Member

olix0r commented Jun 14, 2022

Related to #6086 tower-rs/tower#415

@Aleksei-Poliakov
Copy link

Aleksei-Poliakov commented Jun 14, 2022

Hi!

A few updates:

  1. We rolled back to 10.2 and still see the same issue happening.
  2. With the mitigation in place that just restarts effected pods - we had more than 20 instances of this happening in production today, so if you need more logs or have ideas on what to try - I will likely be able to do so.

Other information that might be relevant:

  • we are running on AWS EKS 1.19
  • it looks like sometimes 2.10 fails with a similar, but a different issue - I will dive into this more tomorrow, but so far sometimes there is no "panicked at 'cancel sender lost", yet a pod goes into a broken state and gRPC observes UNAVAILABLE errors. Other times the issue is the same one we saw on 2.11.2 and the mitigation below seems to help.

For anyone who stumbles on the issue, here is the code we are using inside cronjob (that runs every minute) to restart affected pods:

for pod in `kubectl get pods | grep 2/2 | awk '{print $1}'`; do
  if kubectl logs $pod -c linkerd-proxy --since=2m | grep -q "thread 'main' panicked at 'cancel sender lost'"; then
    kubectl delete pods $pod;
  fi;
done;

@johnswarbrick
Copy link

johnswarbrick commented Jun 14, 2022

I'm experiencing the exact same issue with both stable-2.11.2 and the latest edge-22.6.1 on Azure AKS.

image

All Java applications running HTTP/1. Protocol detection seems to be okay:

protocol-detection

I tried adding the container port to the opaque ports list through an annotation but this didn't appear to have much effect.

Tried enabling trace logs but the output was huge! If anyone can suggest a filter that can be applied to better target the trace logs then I can give that a try.

@hawkw
Copy link
Member

hawkw commented Jun 15, 2022

Hi @Aleksei-Poliakov and @johnswarbrick, out of interest, do you know what instance types and AMIs your EKS worker nodes are? Thanks!

@johnswarbrick
Copy link

Hi @hawkw. We are using Azure AKS. Standard_D8ads_v5 running AKSUbuntu-1804gen2containerd-2022.04.13.

@hawkw
Copy link
Member

hawkw commented Jun 15, 2022

@johnswarbrick thanks — I must have misread your original post, I thought you were both on EKS. My bad!

@johnswarbrick
Copy link

Attached logfiles in debug level.

linkerd-debug-log.txt

@olix0r
Copy link
Member

olix0r commented Jun 16, 2022

@johnswarbrick I noticed (as described in #8677) that your proxy is logging service discovery updates pretty aggressively. While these shouldn't cause the problems you're seeing, I'd like to understand more about what's causing them. Is there anything interesting in the output of k logs -fn linkerd deploy/linkerd-destination destination?

@Aleksei-Poliakov
Copy link

Aleksei-Poliakov commented Jun 16, 2022

@johnswarbrick The original problem happened on AMI 1.19.6-20210526 (this is what's used in production) and I have been running my experiments on 1.19.15-20211206. I will do another round of tests on the AMI that matches version in production and report back.

@hawkw
Copy link
Member

hawkw commented Jun 16, 2022

Hi @Aleksei-Poliakov, @johnswarbrick, and others, I've just published a debug build of the proxy with some additional debug logging for these issues. We'd love it if you could help us out by testing out this image and sending us the logs from any proxy instances that hit this crash. When deploying this image, make sure to set the proxy log level to something that includes tower::ready_cache=trace, such as warn,linkerd=debug,tower::ready_cache=trace.

You can use this proxy image by adding these annotations to a workload or namespace:

annotations:
  config.linkerd.io/proxy-image: mycoliza/l2-proxy
  config.linkerd.io/proxy-version: ready-cache-debug.ff808285a
  config.linkerd.io/proxy-log-level: "warn,linkerd=debug,tower::ready_cache=trace"

or, you can run linkerd inject:

$ linkerd inject \
    --proxy-image "mycoliza/l2-proxy" \
    --proxy-version "ready-cache-debug.ff808285a" \
    --proxy-log-level "warn,linkerd=debug,tower::ready_cache=trace" \
    <OTHER ARGS>

Thanks in advance --- your help debugging this is greatly appreciated!

@olix0r
Copy link
Member

olix0r commented Jun 17, 2022

We have a isolated a similar repro and fixed it in tower-rs/tower#668 linkerd/linkerd2-proxy#1753. The fix can be tested with:

annotations:
  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 17, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
olix0r pushed a commit to linkerd/linkerd2-proxy that referenced this issue Jun 17, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
@johnswarbrick-napier
Copy link

Hello @olix0r. Thank you for the patch, will test and revert asap!

Does this patch address both the "panicked at 'cancel sender lost'" and the overly aggressive service discovery updates?

@olix0r olix0r reopened this Jun 17, 2022
@olix0r
Copy link
Member

olix0r commented Jun 17, 2022

@johnswarbrick-napier The proxy change addresses the panics. I believe that the frequent discovery updates made the proxy more likely to trigger the panic behavior, but they shouldn't cause problems if the panic behavior is fixed. We'll probably look into the discovery updates more as we have time, especially if we're able to narrow down a repro.

If you can confirm that the proxy change addresses the bug, we'll probably release a 2.11.3 with the panic fix.

@johnswarbrick
Copy link

I've applied these annotations:

  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

...and for the past hour I've not seen any occurrences of either of these log entries:

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13

linkerd_app_core::serve: Server failed to become ready error=buffered service failed: buffered service failed: task 28182 panicked error.sources=[buffered service failed: task 28182 panicked, task 28182 panicked] client.addr=10.244.3.95:49580

I will continue to monitor for 24 hours and report back, but so far this patch version is looking very promising.

@johnswarbrick
Copy link

Hello @olix0r - I can confirm the patch appears to be successful!

I've been running load tests for the past day and everything appears to be working and stable.

If you could push a 2.11.3 release that would be fantastic as it would be an easy upgrade to our Production clusters that have the current stable release.

Thank you so much for your quick help and support.

@olix0r olix0r added this to the stable-2.11.3 milestone Jun 21, 2022
@kleimkuhler kleimkuhler mentioned this issue Jun 30, 2022
13 tasks
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
hawkw added a commit to linkerd/linkerd2-proxy that referenced this issue Jun 30, 2022
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache`
module, tower-rs/tower#415. The `ready_cache` module is used internally
in Tower's load balancer. This bug resulted in panics in the proxy
(linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the
Destination service sends a very large number of service discovery
updates (see linkerd/linkerd2#8677).

This commit updates the proxy's dependency on `tower` to 0.4.13, to
ensure that this bugfix is picked up.

Fixes linkerd/linkerd2#8666
Fixes linkerd/linkerd2#6086

[v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
@hawkw
Copy link
Member

hawkw commented Jul 6, 2022

This is fixed in stable-2.11.3.

@hawkw hawkw closed this as completed Jul 6, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants