Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"HTTP Logical service in fail-fast" logs in linkerd-proxy #5599

Closed
hadican opened this issue Jan 25, 2021 · 29 comments
Closed

"HTTP Logical service in fail-fast" logs in linkerd-proxy #5599

hadican opened this issue Jan 25, 2021 · 29 comments
Assignees
Milestone

Comments

@hadican
Copy link

hadican commented Jan 25, 2021

Bug Report

In our system, we have 400+ lightweight pods, which are making requests with gRPC to 6 powerful pods (runs on preemtible node) with 3-4 RPS for each. (All 6 pods have Nvidia GPU) After 1-2 mins of successful deployment and auto Linkerd proxy injection with kubectl -n the_namespace get deploy/the_deployment -o yaml | linkerd inject - | kubectl apply -f -, we are seeing hundreds of Failed to proxy request: HTTP Logical service in fail-fast errors. I've checked the source code of linkerd-proxy repository, I found the place where this message is printed but couldn't understand the error, and the message. I mean, what is HTTP Logical service? Below you can find a snippet of logs.

WARN ThreadId(01) inbound:accept{peer.addr=10.37.125.7:40084 target.addr=10.36.216.4:50053}: linkerd_timeout::failfast: HTTP Logical entering failfast after 1s
WARN ThreadId(01) inbound:accept{peer.addr=10.37.126.8:46412 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.61.7:37474 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
WARN ThreadId(01) inbound:accept{peer.addr=10.37.72.4:52626 target.addr=10.36.216.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
... (tens of them in same second, hundreds in minute)

I've waited for 10-15 mins, but system didn't recover. So I reverted and uninject proxy from pods.

In linkerd-proxy source code, I think there was a comment saying "the inner service is not ready", but readiness/liveness was passing. I thought this might be about dispatch timeout, wanted to increase it but it's not configurable. (config.rs) I thought it might be related with all this high load. I installed with HA mode, nothing is changed. I increased both CPU and memory proxy request/limits, but again nothing is changed.

All after these actions we're getting fail fast errors. What is fail fast condition? What is "HTTP logical"? 🧐 How can we resolve this?

linkerd check output:

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods

Status check results are √

Environment

  • Kubernetes: 1.16.15-gke.6000
  • Linkerd Version: stable-2.9.2 with HA
  • Linkerd Proxy Version: edge-21.1.3

Additional context

Let me know if you need further information.

@olix0r
Copy link
Member

olix0r commented Jan 25, 2021

Thanks @hadican; we've been looking into similar reports of inbound proxies getting stuck in failfast.

It would be helpful to see debug logs from this proxy. You can enable dbeug logging by adding the following annotation to your pod template: config.linkerd.io/proxy-log-level: linkerd=debug.warn

It may be the case that our connect timeout is too aggressive and the proxy is unable to establish a connection, or this could point to a more subtle bug...

@hawkw
Copy link
Member

hawkw commented Jan 25, 2021

It would be helpful to see debug logs from this proxy. You can enable dbeug logging by adding the following annotation to your pod template: config.linkerd.io/proxy-log-level: linkerd=debug.warn

Note that the period between debug and warn should be a comma:

 config.linkerd.io/proxy-log-level: linkerd=debug,warn

Just in case you copy and pasted what @olix0r wrote directly :)

@hadican
Copy link
Author

hadican commented Jan 25, 2021

Hi @olix0r & @hawkw,

I've tried it back then, to see if I can catch some thing. Here are the logs:

Not sure if I catched the correct ones.

[457.686366s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_app_inbound::require_identity_for_ports: port=50053 peer.id=None(NoTlsFromRemote) id_required=false
[457.686434s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_detect: Detected protocol=Some(H2) elapsed=18.625µs
[457.686461s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_cache: Caching new service target=(Some(H2), TcpAccept { target_addr: 10.37.69.4:50053, client_addr: 10.37.113.5:54104, client_id: None(NoTlsFromRemote) })
[457.686472s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_proxy_http::server: Creating HTTP service version=H2
[457.686518s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_proxy_http::server: Handling as HTTP version=H2
[457.686537s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: h2::codec::framed_write: send frame=Settings { flags: (0x0), initial_window_size: 65535, max_frame_size: 16384 }
[457.686656s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Settings { flags: (0x0), enable_push: 0, max_concurrent_streams: 0, initial_window_size: 4194304, max_frame_size: 4194304, max_header_list_size: 8192 }
[457.686672s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=Settings { flags: (0x1: ACK) }
[457.686680s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 4128769 }
[457.686698s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 983041 }
[457.690754s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Settings { flags: (0x1: ACK) }
[457.690778s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::proto::settings: received settings ACK; applying Settings { flags: (0x0), initial_window_size: 65535, max_frame_size: 16384 }
[457.690807s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[457.690841s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(1), size_increment: 5 }
[457.690938s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.135.6:42924 target.addr=10.37.69.4:50053}:http{v=h2}:http2{name=estimator-service.dl-service.svc.cluster.local:50053}: linkerd_timeout::failfast: HTTP Logical in failfast
[457.690961s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:http{v=h2}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[457.690982s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:http{v=h2}: linkerd_app_core::errors: Closing server-side connection
[457.690993s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:http{v=h2}: linkerd_app_core::errors: Handling error with gRPC status code=Unavailable
[457.691019s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
[457.691028s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(2147483647) }
[457.691035s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=Ping { ack: false, payload: [11, 123, 162, 240, 139, 155, 254, 84] }
[457.691075s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) }
[457.691093s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=Reset { stream_id: StreamId(1), error_code: CANCEL }
[457.691180s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }
[457.691202s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }
[457.691293s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }
[457.691437s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }
[457.691457s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5 }
[457.692138s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_read: received frame=Ping { ack: true, payload: [11, 123, 162, 240, 139, 155, 254, 84] }
[457.692155s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(1) }
[457.692205s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}:Connection{peer=Server}: h2::proto::connection: Connection::poll; connection error error=NO_ERROR
[457.692270s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.113.5:54104 target.addr=10.37.69.4:50053}: linkerd_app_core::serve: Connection closed
[457.692281s] DEBUG ThreadId(01) evict{target=(Some(H2), TcpAccept { target_addr: 10.37.69.4:50053, client_addr: 10.37.113.5:54104, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Awaiting idleness
[457.696414s] DEBUG ThreadId(01) evict{target=(Some(H2), TcpAccept { target_addr: 10.37.69.4:50053, client_addr: 10.36.250.3:58648, client_id: None(NoTlsFromRemote) })}: linkerd_cache: Cache entry dropped

@olix0r
Copy link
Member

olix0r commented Jan 25, 2021

I've made some changes to the HTTP Logical stack that could potentially address this and, if not, should help us have some clearer diagnostics around this part of the stack. If you can set the following pod annotations and share logs from the first ~30s of the pod's runtime, that may help shed light on the issue:

config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: main.b8c2aa71
config.linkerd.io/proxy-log-level: linkerd_stack::switch_ready=debug,linkerd_reconnect=debug,linkerd_transport::connect=debug,linkerd=info,warn

@hadican
Copy link
Author

hadican commented Jan 26, 2021

Hi @olix0r,

I've tried with your recent build and here are the logs of proxy container from the beginning:

time="2021-01-26T07:43:09Z" level=info msg="running version edge-21.1.2"
[0.004630s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[0.005859s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[0.006047s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[0.006152s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[0.006279s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[0.006381s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.dl-service.serviceaccount.identity.linkerd.cluster.local
[0.006502s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.006694s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.007021s] INFO ThreadId(01) outbound: linkerd_app: listen.addr=127.0.0.1:4140 ingress_mode=false
[0.007683s] INFO ThreadId(01) inbound: linkerd_app: listen.addr=0.0.0.0:4143
[0.065776s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.dl-service.serviceaccount.identity.linkerd.cluster.local
[4.587019s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.49.5:42682 target.addr=10.38.191.6:5053}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[6.845144s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.49.4:46280 target.addr=10.38.191.6:5053}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[55.452467s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.35.4:60942 target.addr=10.38.191.6:50053}: linkerd_stack::switch_ready: Switching to secondary service after=500ms
[57.285432s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.35.4:60942 target.addr=10.38.191.6:50053}: linkerd_timeout::failfast: HTTP Logical entering failfast after 1s
[57.285692s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.181.2:51676 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285753s] WARN ThreadId(01) inbound:accept{peer.addr=10.36.250.3:45168 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285777s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.89.5:35396 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285796s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.154.5:44686 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285811s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.181.3:42148 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285839s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.88.5:45122 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
[57.285857s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.179.5:40120 target.addr=10.38.191.6:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
... (it goes like this)

I've waited more than 30 secs, waited for 2-3 mins but didn't recover. I'll check again, but did not see any error other than "HTTP Logical service in fail-fast".

I injected with:

linkerd inject --proxy-image=ghcr.io/olix0r/l2-proxy --proxy-version=main.b8c2aa71 --proxy-log-level=linkerd_stack::switch_ready=debug,linkerd_reconnect=debug,linkerd_transport::connect=debug,linkerd=info,warn -

Also, one thing that I noticed; in the logs the version seem as edge-21.1.2 but latest is edge-21.1.3 which includes some improvements about this issue. Did we miss something here? 🤔

One more thing, I think this happens right after whole pod switch, but not sure, couldn't track the logs simultaneously. Maybe after the last node terminated, it does not discover new pods? 🤔

@liuerfire
Copy link

Hi @hadican, just provide a idea to help you troubleshooting:
You can check the pod's iptables rules. If you enabled the CNI plugin and didn't set --use-wait-flag, it may cause broken iptables rules.

@olix0r
Copy link
Member

olix0r commented Jan 26, 2021

Thanks. Looking at this, I realize i mis-specified the log configuration. It would be great if you can try again with:

--proxy-log-level=linkerd_stack::switch_ready=debug,linkerd_reconnect=debug,linkerd_proxy=debug,linkerd=info,warn

It's curious to me that we don't see any logging from the reconnect module, though we see connection refused errors.

It's also interesting that we see attempts at 4s, 6s, and then at 55s -- there's a 50s gap there. What I'm trying to understand from the logs is whether we're trying to connect or whether there's something blocking the reconnect logic from being applied.

Also, one thing that I noticed; in the logs the version seem as edge-21.1.2 but latest is edge-21.1.3 which includes some improvements about this issue. Did we miss something here? 🤔

No, that's fine. When we build development proxy images, the base image isn't always the latest; but since we're replacing the proxy binary, it's fine. The base image provides the start script that generates a private key, etc, that we use to bootstrap identity.

Thanks for your help tracking this down!

@olix0r olix0r added this to the stable-2.10 milestone Jan 26, 2021
@hadican
Copy link
Author

hadican commented Jan 26, 2021

Hi @olix0r,

Here is the log of start:

2021-01-26 22:46:27.327 EET time="2021-01-26T19:46:27Z" level=info msg="running version edge-21.1.2"
2021-01-26 22:46:27.386 EET [ 0.004028s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
2021-01-26 22:46:27.387 EET [ 0.005291s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
2021-01-26 22:46:27.387 EET [ 0.005466s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
2021-01-26 22:46:27.387 EET [ 0.005555s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
2021-01-26 22:46:27.387 EET [ 0.005641s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
2021-01-26 22:46:27.388 EET [ 0.005729s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.dl-service.serviceaccount.identity.linkerd.cluster.local
2021-01-26 22:46:27.388 EET [ 0.005815s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
2021-01-26 22:46:27.388 EET [ 0.005901s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
2021-01-26 22:46:27.388 EET [ 0.006234s] INFO ThreadId(01) outbound: linkerd_app: listen.addr=127.0.0.1:4140 ingress_mode=false
2021-01-26 22:46:27.389 EET [ 0.006706s] INFO ThreadId(01) inbound: linkerd_app: listen.addr=0.0.0.0:4143
2021-01-26 22:46:27.420 EET [ 0.038169s] DEBUG ThreadId(01) identity: linkerd_proxy_transport::connect: Connecting peer.addr=10.36.204.14:8080
2021-01-26 22:46:27.420 EET [ 0.038305s] DEBUG ThreadId(01) identity: linkerd_proxy_transport::connect: Connecting peer.addr=10.37.49.12:8080
2021-01-26 22:46:27.420 EET [ 0.038351s] DEBUG ThreadId(01) identity: linkerd_proxy_transport::connect: Connecting peer.addr=10.37.51.14:8080 [ 0.039686s] DEBUG ThreadId(01) identity:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:59488 keepalive=Some(10s)
2021-01-26 22:46:27.422 EET [ 0.040049s] DEBUG ThreadId(01) identity:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:50234 keepalive=Some(10s)
2021-01-26 22:46:27.423 EET [ 0.041237s] DEBUG ThreadId(01) identity:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:38784 keepalive=Some(10s)
2021-01-26 22:46:27.437 EET [ 0.054729s] DEBUG ThreadId(02) daemon:identity: linkerd_proxy_identity::certify: daemon certified until SystemTime { tv_sec: 1611776807, tv_nsec: 0 }
2021-01-26 22:46:27.437 EET [ 0.054839s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.dl-service.serviceaccount.identity.linkerd.cluster.local
2021-01-26 22:46:29.819 EET [ 2.437080s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.37.49.4:55846 target.addr=10.38.151.4:4191}: linkerd_proxy_http::server: Creating HTTP service version=Http1
2021-01-26 22:46:29.819 EET [ 2.437358s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.37.49.4:55846 target.addr=10.38.151.4:4191}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:29.997 EET [ 2.614605s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Creating HTTP service version=Http1
2021-01-26 22:46:29.997 EET [ 2.614984s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:29.997 EET [ 2.615454s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}: linkerd_proxy_http::client: Building HTTP client settings=Http1
2021-01-26 22:46:29.997 EET [ 2.615625s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_http::client: method=GET uri=http://10.38.151.4:5053/metrics version=HTTP/1.1
2021-01-26 22:46:29.998 EET [ 2.615712s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_http::client: headers={"host": "10.38.151.4:5053", "user-agent": "Prometheus/2.11.0", "accept": "application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000"}
2021-01-26 22:46:29.998 EET [ 2.615814s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_http::h1: Caching new client use_absolute_form=false
2021-01-26 22:46:29.998 EET [ 2.616033s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_transport::connect: Connecting peer.addr=127.0.0.1:5053
2021-01-26 22:46:29.998 EET [ 2.616383s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
2021-01-26 22:46:29.998 EET [ 2.616633s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:46:32.922 EET [ 5.540138s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33432 target.addr=169.254.169.254:80}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2021-01-26 22:46:32.942 EET [ 5.560364s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33434 target.addr=169.254.169.254:80}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:32.942 EET [ 5.560428s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33422 target.addr=169.254.169.254:80}:http1: linkerd_proxy_http::client: method=GET uri=http://metadata/computeMetadata/v1/instance/attributes/cluster-name version=HTTP/1.1
2021-01-26 22:46:32.942 EET [ 5.560437s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33422 target.addr=169.254.169.254:80}:http1: linkerd_proxy_http::client: headers={"host": "metadata", "user-agent": "python-requests/2.25.1", "accept-encoding": "gzip, deflate", "accept": "*/*", "metadata-flavor": "Google"}
2021-01-26 22:46:32.942 EET [ 5.560476s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33422 target.addr=169.254.169.254:80}:http1: linkerd_proxy_transport::connect: Connecting peer.addr=169.254.169.254:80
2021-01-26 22:46:32.943 EET [ 5.560745s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33422 target.addr=169.254.169.254:80}:http1: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:33436 keepalive=Some(10s)
2021-01-26 22:46:32.943 EET [ 5.560763s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33422 target.addr=169.254.169.254:80}:http1: linkerd_proxy_transport::metrics: client connection open
2021-01-26 22:46:32.944 EET [ 5.562364s] DEBUG ThreadId(01) outbound:accept{peer.addr=10.38.151.4:33434 target.addr=169.254.169.254:80}: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2021-01-26 22:46:34.448 EET [ 7.066463s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.37.51.10:46342 target.addr=10.38.151.4:4191}: linkerd_proxy_http::server: Creating HTTP service version=Http1
2021-01-26 22:46:34.448 EET [ 7.066501s] DEBUG ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{peer.addr=10.37.51.10:46342 target.addr=10.38.151.4:4191}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:35.120 EET [ 7.737769s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.52.2:55812 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Creating HTTP service version=Http1
2021-01-26 22:46:35.120 EET [ 7.737808s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.52.2:55812 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:35.120 EET [ 7.737931s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_http::client: method=GET uri=http://10.38.151.4:5053/metrics version=HTTP/1.1
2021-01-26 22:46:35.120 EET [ 7.737940s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_http::client: headers={"host": "10.38.151.4:5053", "user-agent": "Prometheus/2.11.0", "accept": "application/openmetrics-text; version=0.0.1,text/plain;version=0.0.4;q=0.5,*/*;q=0.1", "accept-encoding": "gzip", "x-prometheus-scrape-timeout-seconds": "10.000000"}
2021-01-26 22:46:35.120 EET [ 7.737981s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_transport::connect: Connecting peer.addr=127.0.0.1:5053
2021-01-26 22:46:35.120 EET [ 7.738102s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_transport::connect: Connected local.addr=127.0.0.1:49512 keepalive=None
2021-01-26 22:46:35.120 EET [ 7.738111s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.4:55096 target.addr=10.38.151.4:5053}:http1: linkerd_proxy_transport::metrics: client connection open
2021-01-26 22:46:36.102 EET [ 8.720027s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.49.5:50520 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Creating HTTP service version=Http1
2021-01-26 22:46:36.102 EET [ 8.720067s] DEBUG Thread d(01) inbound:accept{peer.addr=10.37.49.5:50520 target.addr=10.38.151.4:5053}: linkerd_proxy_http::server: Handling as HTTP version=Http1
2021-01-26 22:46:36.102 E

After a while...

2021-01-26 22:47:19.435 EET [52.053317s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: Handling as HTTP version=H2
2021-01-26 22:47:19.474 EET [52.091611s] DEBUG ThreadId(01) dst: linkerd_proxy_transport::connect: Connecting peer.addr=10.37.49.13:8086
2021-01-26 22:47:19.474 EET [52.091694s] DEBUG ThreadId(01) dst: linkerd_proxy_transport::connect: Connecting peer.addr=10.37.51.16:8086
2021-01-26 22:47:19.474 EET [52.091744s] DEBUG ThreadId(01) dst: linkerd_proxy_transport::connect: Connecting peer.addr=10.37.52.21:8086
2021-01-26 22:47:19.475 EET [52.092726s] DEBUG ThreadId(01) dst:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:39810 keepalive=Some(10s)
2021-01-26 22:47:19.475 EET [52.092899s] DEBUG ThreadId(01) dst:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:41034 keepalive=Some(10s)
2021-01-26 22:47:19.475 EET [52.093043s] DEBUG ThreadId(01) dst:h2: linkerd_proxy_transport::connect: Connected local.addr=10.38.151.4:42026 keepalive=Some(10s)
2021-01-26 22:47:19.479 EET [52.097214s] DEBUG ThreadId(01) linkerd_proxy_http::client: Building HTTP client settings=H2
2021-01-26 22:47:19.479 EET [52.097232s] DEBUG ThreadId(01) linkerd_proxy_transport::connect: Connecting peer.addr=127.0.0.1:50053
2021-01-26 22:47:19.479 EET [52.097536s] DEBUG ThreadId(01) h2: linkerd_proxy_transport::connect: Connected local.addr=127.0.0.1:46886 keepalive=None
2021-01-26 22:47:19.479 EET [52.097552s] DEBUG ThreadId(01) h2: linkerd_proxy_transport::metrics: client connection open
2021-01-26 22:47:19.480 EET [52.097778s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:47:19.480 EET [52.097794s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2S"}
2021-01-26 22:47:19.680 EET [52.297683s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.133.2:38332 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: Creating HTTP service version=H2
2021-01-26 22:47:19.680 EET [52.297904s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.133.2:38332 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: Handling as HTTP version=H2

It was working fine, then went into fail-fast. These logs are after 1-2 mins:

2021-01-26 22:52:13.245 EET [345.863003s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:52:13.245 EET [345.863183s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2010m"}
2021-01-26 22:52:13.317 EET [345.934843s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:52:13.317 EET [345.935118s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2S"}
2021-01-26 22:52:13.337 EET [345.955219s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:52:13.337 EET [345.955480s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2S"}
2021-01-26 22:52:13.369 EET [345.986858s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:52:13.369 EET [345.987176s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2S"}
2021-01-26 22:52:13.449 EET [346.066949s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: method=POST uri=http://estimator-service.dl-service.svc.cluster.local:50053/estimation/DetectSingle version=HTTP/2.0
2021-01-26 22:52:13.449 EET [346.067198s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}:h2: linkerd_proxy_http::client: headers={"te": "trailers", "content-type": "application/grpc", "user-agent": "grpc-python-asyncio/1.34.0 grpc-c/14.0.0 (linux; chttp2)", "grpc-accept-encoding": "identity,deflate,gzip", "accept-encoding": "identity,gzip", "grpc-timeout": "2010m"}
2021-01-26 22:52:14.468 EET [347.084976s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.174.5:55588 target.addr=10.38.151.4:50053}: linkerd_timeout::failfast: HTTP Logical entering failfast after 1s
2021-01-26 22:52:14.468 EET [347.085066s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.130.5:45496 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085102s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.215.2:58980 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085119s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.75.4:35400 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085131s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.215.6:48374 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085145s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.201.6:47102 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085170s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.130.5:45496 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.468 EET [347.085237s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.215.2:58980 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.468 EET [347.085274s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.75.4:35400 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.468 EET [347.085341s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.215.6:48374 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.468 EET [347.085361s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.201.6:47102 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.468 EET [347.085408s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.193.4:60734 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.468 EET [347.085426s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.38.193.4:60734 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.551 EET [347.169512s] WARN ThreadId(01) inbound:accept{peer.addr=10.37.167.5:48602 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast
2021-01-26 22:52:14.552 EET [347.169572s] DEBUG ThreadId(01) inbound:accept{peer.addr=10.37.167.5:48602 target.addr=10.38.151.4:50053}: linkerd_proxy_http::server: The stack is tearing down the connection
2021-01-26 22:52:14.589 EET [347.207153s] WARN ThreadId(01) inbound:accept{peer.addr=10.38.238.3:38524 target.addr=10.38.151.4:50053}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast

@hadican
Copy link
Author

hadican commented Jan 26, 2021

Hi @liuerfire, thanks for the tip. We don't use the CNI plugin. However, do you have any other tips? 🤔

@olix0r
Copy link
Member

olix0r commented Jan 26, 2021

@hadican Thanks, this is helpful. Just for the sake of ruling out configuration issues: is it possible that your gRPC server is setting a stream concurrency limit?

@hadican
Copy link
Author

hadican commented Jan 27, 2021

Hi @olix0r,

I believe it's 10. ('grpc.max_concurrent_streams', 10)

@hadican
Copy link
Author

hadican commented Feb 1, 2021

Hi @olix0r, any updates on this issue? 😔

@olix0r
Copy link
Member

olix0r commented Feb 1, 2021

@hadican We've recently spent some time trying to reproduce this using demo apps and we've been unable to trigger this behavior. Specifically, we modified our server's container to block for 2 minutes before starting the application. The proxy encounters connection refused errors and goes into failfast. However, once the application starts, we leave the failfast state and process requests as expected.

You mention above that you have max-concurrent-streams configured to 10. Is it possible that you have 10 in-flight requests and the proxy is unable send additional requests? Are you able to increase this limit? If you do, does that change the behavior?

@karsten42
Copy link

Coming here from #5183.
We were experiencing fail fast errors on the server side proxy when having max-concurrent-streams set.

Now we are seeing the Failed to proxy request: HTTP Logical service in fail-fast errors on another service which definitely doesn't have this configuration. ^^

A bit of context to the service in question:

  • it takes about 30 minutes to actually boot up, during that time the http liveness probes and prometheus metrics calls will fail with connection refused because the server is actually not listening yet. These will eventually served by a normal HTTP server listening on 8082.
  • the service serves the same grpc service on two different ports 8081 (no TLS) and 8084 (TLS using self signed certs). The reason for this is the need for certificates when connecting gcloud ingress with grpc service. The second port (8084) is not used at all at the moment, but it is exposed in the deployment.

I used the configuration @olix0r suggested here. The proxy image we are using is: ghcr.io/olix0r/l2-proxy:main.b8c2aa71.

I put the logs from the startup of the service and after it fully booted up in this gist.

One log message caught my eye, which is: linkerd_stack::switch_ready: Switching to secondary service after=500ms
What does that mean? I couldn't find anything in the repo.

We currently have a version running of this service without any issues which only has one port that serves grpc (8081 no TLS) and is running this proxy image ghcr.io/linkerd/proxy:stable-2.9.0.

Hope this helps.

@hawkw
Copy link
Member

hawkw commented Feb 2, 2021

One log message caught my eye, which is: linkerd_stack::switch_ready: Switching to secondary service after=500ms
What does that mean? I couldn't find anything in the repo.

This means that the proxy's client for talking to the injected service didn't become ready (i.e. the proxy was't able to connect to the service) after 500 ms, and so the proxy switched to start eagerly failing requests until the client connection to the service becomes usable. It's emitted here: https://github.com/linkerd/linkerd2-proxy/blob/b8c2aa7168d93eda3ea5f2e3b7fc6a9910cf0c4f/linkerd/stack/src/switch_ready.rs#L127

@hawkw
Copy link
Member

hawkw commented Feb 2, 2021

I've been trying to reproduce this using a simple demo client/service, and varying the amount of time the demo service waits before accepting connections, the latency distribution of the demo service, the number of clients, and the gRPC per-connection concurrency limit on the server. So far, I'm still not having any luck — the proxy still comes out of fail-fast as soon as the demo service comes up.

I have a feeling there's probably some missing piece involved that my test setup isn't capturing, but I'm not sure what it is. @karsten42 and @hadican, can you think of any other details we might be missing?

@hadican
Copy link
Author

hadican commented Feb 3, 2021

@olix0r @hawkw We have updated max_concurrent_streams from 10 to 100. No more fail-fast errors. 🎉

Thanks for the tips and help!

@hawkw
Copy link
Member

hawkw commented Feb 3, 2021

@hadican thanks — that's also useful to know as we try to narrow down what's going on in the proxy. Do you have a rough estimate of the request load on the service?

@karsten42
Copy link

@hawkw I played around some more with the setup I described here.
The only way I can reliably reproduce this is with having three ports serving the same gRPC service and the startup process taking 30 minutes.
So the port setup is:

  • 8081: grpc server without TLS
  • 8082: http server serving metrics and is used for the k8s readiness probes
  • 8084: grpc service with TLS (self signed)
  • 8085: grpc server without TLS (backup to 8081) ignored by linkerd using config.linkerd.io/skip-inbound-ports. We wanted to use this to quickly switch between linkerd and non linkerd so we don't have to restart the pod (which takes 30 minutes)

I have also seen this issue when the port 8085 wasn't active but not as reliably as the setup just described.

@hawkw
Copy link
Member

hawkw commented Feb 4, 2021

@olix0r @hawkw We have updated max_concurrent_streams from 10 to 100. No more fail-fast errors. 🎉

Thanks for the tips and help!

To clarify, is this the max_concurrent_streams setting on the gRPC client, or the gRPC server? Thanks!

@karsten42
Copy link

@olix0r @hawkw If it would help we could also do a debug session where you can take look at the live pods that show this issue.

@fpetkovski
Copy link
Contributor

fpetkovski commented Feb 9, 2021

We have a similar problem where two pods which talk to each other over http are in the service mesh. We get intermittent 503s after 3-5 minutes after deploying them on the mesh. Here are the logs from the pod initiating the connection:

  | 2021-02-09 15:47:15.440 | [  1225.987350s]  INFO ThreadId(01) outbound:accept{peer.addr=10.250.166.139:56588 target.addr=172.20.187.110:80}: linkerd_timeout::failfast: HTTP Logical service has recovered

  | 2021-02-09 15:47:11.764 | [  1222.311160s]  WARN ThreadId(01) outbound:accept{peer.addr=10.250.166.139:56588 target.addr=172.20.187.110:80}: linkerd_timeout::failfast: HTTP Logical entering failfast after 3s

  | 2021-02-09 15:47:11.764 | [  1222.311240s]  WARN ThreadId(01) outbound:accept{peer.addr=10.250.166.139:56588 target.addr=172.20.187.110:80}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast

  | 2021-02-09 15:47:11.230 | [  1221.777336s]  INFO ThreadId(01) outbound:accept{peer.addr=10.250.166.139:46098 target.addr=172.20.35.45:80}: linkerd_timeout::failfast: HTTP Logical service has recovered

  | 2021-02-09 15:47:10.733 | [  1221.280769s]  WARN ThreadId(01) outbound:accept{peer.addr=10.250.166.139:46554 target.addr=172.20.35.45:80}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast

  | 2021-02-09 15:47:10.732 | [  1221.278949s]  INFO ThreadId(01) outbound:accept{peer.addr=10.250.166.139:46098 target.addr=172.20.35.45:80}: linkerd_app_core::serve: Connection closed error=connection closed before message completed

  | 2021-02-09 15:47:10.732 | [  1221.278998s]  WARN ThreadId(01) outbound:accept{peer.addr=10.250.166.139:46506 target.addr=172.20.35.45:80}: linkerd_app_core::errors: Failed to proxy request: HTTP Logical service in fail-fast

  | 2021-02-09 15:47:10.731 | [  1221.278801s]  WARN ThreadId(01) outbound:accept{peer.addr=10.250.166.139:46098 target.addr=172.20.35.45:80}: linkerd_timeout::failfast: HTTP Logical entering failfast after 3s

What is a logical service and how does the proxy decide to put it in failfast?

Also, this only happens when the client is also in the mesh. If we take the client out the mesh, we see zero 503s.

@hadican
Copy link
Author

hadican commented Feb 11, 2021

@hawkw Hi, we set "max_concurrent_streams" config at server side. Each pod has around 20RPS.

@fpetkovski
Copy link
Contributor

fpetkovski commented Feb 11, 2021

Our issue was caused by #5684.

Might be worth checking the latency of linkerd-destination when these errors happen to see whether the root cause is the same.

@svilensabev
Copy link

I hit this error while checking multi-cluster capabilities of Linkerd. I also spotted proxy max-concurrency exhausted error from related issue inside logs. The traffic splits didn't work out.

To me it happen during test flight of this great demo from https://github.com/olix0r/l2-k3d-multi

I made few runs and investigated a bit with proxy-log-level: trace with no success (My proxy debugging skills are limited).

Then I read the docs regarding exporting services and realized that annotations for export were wrong. In the demo repo it is using the one below https://github.com/olix0r/l2-k3d-multi/blob/master/base/export_service.yml

- op: add
  path: /metadata/annotations
  value:
    mirror.linkerd.io/gateway-name: linkerd-gateway
    mirror.linkerd.io/gateway-ns: linkerd-multicluster

while the docs are stating that it needs a label mirror.linkerd.io/exported: "true" . https://linkerd.io/2/tasks/multicluster/#exporting-the-services
So I made this change

- op: add
  path: /metadata/labels
  value:
    mirror.linkerd.io/exported: "true"

and I was able to get the multicluster demo working with traffic splits. 🛰️


linkerd version
Client version: edge-21.2.2
Server version: edge-21.2.2

Hope this helps in investigation.

@olix0r
Copy link
Member

olix0r commented Feb 17, 2021

Thanks @svilensabev -- i'll make sure we update that repo to use the newer labels!

@adleong
Copy link
Member

adleong commented Mar 13, 2021

@hadican I'm going to close this issue due to inactivity for now. If you're still experiencing this, I recommend trying stable-2.10.0 or the latest edge release and re-opening this issue if you still see the problem.

@adleong adleong closed this as completed Mar 13, 2021
@pepsi1k
Copy link

pepsi1k commented Mar 19, 2021

@adleong I use stable-2.10.0 and I see the same logs on all linkerd-proxy containers

[bogdan@admin scripts]$ k logs linkerd-destination-5694c5787c-pm27w linkerd-proxy
time="2021-03-19T13:37:56Z" level=info msg="running version stable-2.10.0"
[     0.002365s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.003148s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003166s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003171s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003174s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003177s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.003181s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.003203s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[     0.005384s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     0.007913s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     1.990255s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.49.169:44714 target.addr=10.112.57.233:9996}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[     2.522311s]  INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}:accept{client.addr=127.0.0.1:35320 target.addr=127.0.0.1:4191}: linkerd_app_core::serve: Connection closed error=TLS detection timeout
[     5.008077s]  WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[     5.139031s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.49.169:44732 target.addr=10.112.57.233:9996}: linkerd_app_core::errors: Failed to proxy request: error trying to connect: Connection refused (os error 111)
[    10.517323s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[    15.020669s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.66.72:55734 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[    15.530595s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.84.113:36198 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   127.078861s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.66.72:55734 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   214.630995s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.79.192:57094 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   242.964905s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.64.89:41700 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached
[   246.823316s]  WARN ThreadId(01) inbound:accept{client.addr=10.112.90.217:41114 target.addr=10.112.57.233:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=end of stream reached

@kleimkuhler
Copy link
Contributor

@pepsi1k the Failed to resolve control-plane component error=end of stream reached errors have been fixed and are part of the recent edges such as edge-21.4.1. This fix will be included in stable 10.1.

The Failed to resolve control-plane component error are not significant and resolve once all control plane components start up.

If you continue to run into issues, please open a new issue so that we can start from a fresh description of the issue. Thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants