Almost every app gets UC errors, 0.012% of all requests in 24h period #13848

Stono · 2019-05-06T18:10:07Z

I've noticed pretty much every application periodically gets this, a 503 with the envoy code UC.

It's retried as you can see, but it's bothering me that I cannot get to the bottom of why it is happening. My destinationrule should deal with any connection timeouts etc:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: sauron-graphql-server
spec:
  host: app
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1024
        http2MaxRequests: 10240
        idleTimeout: 5m
        maxRequestsPerConnection: 1024
      tcp:
        maxConnections: 10240
        tcpKeepalive:
          interval: 1m
          time: 3m
    loadBalancer:
      simple: LEAST_CONN
    outlierDetection:
      baseEjectionTime: 5s
      consecutiveErrors: 5
      interval: 5s
      maxEjectionPercent: 50
      minHealthPercent: 50
    tls:
      mode: ISTIO_MUTUAL

The frequency of these bothers me, because across a distributed service graph, we can get spikes in latency on client requests as a result.

The text was updated successfully, but these errors were encountered:

howardjohn · 2019-05-06T18:12:24Z

Do you know if this is a new issue you are experiencing?

Stono · 2019-05-06T18:18:40Z

Annoyingly, I'm not sure :-(

Historically I've only really been looking at reporter=source which doesn't show them (presumably due to retries), however investigating some latency blips recently i've started looking at reporter=destination, which is where I saw these.

Stono · 2019-05-06T18:47:02Z

Hmm I have a theory that my updated timeout settings won't apply to existing connections, so the continued 503 UC i'm still seeing are old connections timing out. So will restart a bunch of the apps and leave overnight to observe.

Stono · 2019-05-06T19:01:03Z

Well 10 minutes after restarting them, I saw another :'( so guess I didn't need to wait overnight!

arianmotamedi · 2019-05-06T19:48:00Z

Would you be able to post Envoy router and connection pool logs? I'm wondering if it's a case of a closed-by-upstream connection being reused that results in a 503 (recently fixed in Envoy but hasn't been released yet). Also, can you try setting idleTimeout to a much smaller value like 15s to see if the number of 503s changes at all?

Stono · 2019-05-06T20:00:32Z

The router logs in debug are difficult to capture as they're very verbose, the failure rate is quite low, say 0.02% - it becomes difficult to capture logs from all instances for a given app and catch the UC as i don't shift istio-proxy logs off to external logging (ELK).

I'll try and think of something tomorrow.

In the interim, I'll try adjusting the idleTimeout though and see what happens, but i'm unsure how it would help considering i've configured keep alive to be less than the timeout.

Stono · 2019-05-06T20:11:38Z

15s made no difference 🤷

Stono · 2019-05-06T20:30:47Z

OK @arianmotamedi @howardjohn I persevered and managed to catch one with router and connection logs.

This is from the destination envoy.

[2019-05-06 20:22:57.530][30][debug][router] [external/envoy/source/common/router/router.cc:320] [C17068][S9430344543646919123] cluster 'inbound|80|http-web|app.sauron-seo-app.svc.cluster.local' match for URL '/cars/peugeot/partner/combi'
[2019-05-06 20:22:57.530][30][debug][router] [external/envoy/source/common/router/router.cc:381] [C17068][S9430344543646919123] router decoding headers:
':authority', 'app.sauron-seo-app'
':path', '/cars/peugeot/partner/combi'
':method', 'GET'
... etc etc etc

[2019-05-06 20:22:57.530][30][debug][router] [external/envoy/source/common/router/router.cc:1165] [C17068][S9430344543646919123] pool ready
[2019-05-06 20:22:57.531][28][debug][router] [external/envoy/source/common/router/router.cc:717] [C19043][S5917390943479852622] upstream headers complete: end_stream=false
[2019-05-06 20:22:57.532][28][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C19043] closing data_to_write=1684 type=2
[2019-05-06 20:22:57.532][28][debug][connection] [external/envoy/source/common/network/connection_impl.cc:153] [C19043] setting delayed close timer with timeout 1000 ms
[2019-05-06 20:22:57.536][28][debug][router] [external/envoy/source/common/router/router.cc:320] [C0][S13455846817491226201] cluster 'outbound|15004||istio-telemetry.istio-system.svc.cluster.local' match for URL '/istio.mixer.v1.Mixer/Report'
[2019-05-06 20:22:57.536][28][debug][router] [external/envoy/source/common/router/router.cc:381] [C0][S13455846817491226201] router decoding headers:
':method', 'POST'
':path', '/istio.mixer.v1.Mixer/Report'
...etc etc etc

[2019-05-06 20:22:57.536][28][debug][router] [external/envoy/source/common/router/router.cc:1165] [C0][S13455846817491226201] pool ready
[2019-05-06 20:22:57.540][28][debug][router] [external/envoy/source/common/router/router.cc:717] [C0][S13455846817491226201] upstream headers complete: end_stream=false
[2019-05-06 20:22:57.559][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C19020] remote close
[2019-05-06 20:22:57.559][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C19020] closing socket: 0
[2019-05-06 20:22:57.559][30][debug][router] [external/envoy/source/common/router/router.cc:644] [C17068][S9430344543646919123] upstream reset: reset reason connection termination
[2019-05-06 20:22:57.559][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:461] [C19040] remote early close
[2019-05-06 20:22:57.559][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C19040] closing socket: 0
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C17068] remote close
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C17068] closing socket: 0
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:244] [C17068] SSL shutdown: rc=1
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C17068]
[2019-05-06 20:22:57.581][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C19044] handshake error: 2
[2019-05-06 20:22:57.581][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C19044] handshake error: 2
[2019-05-06 20:22:57.583][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:131] [C19044] handshake complete

Stono · 2019-05-06T20:39:33Z

I feel there is overlap with #13616, even though @jaygorrell is specifically talking about his observations during deployments.

Stono · 2019-05-06T20:40:51Z

I also want to call out the messages i put on slack the other day, we have multiple app that do connection pooling having their connections terminated for no reason, all with the same error. I can't help but feel this is all related.

searchone-consumer talking to search-one (in cluster -> in cluster)
2019-05-02 17:01:59.924 WARN 1 --- [ice-executor-18] u.c.a.s.c.k.service.SearchOneService : Error talking to S1: Connection has been shut down - retrying

sauron-web talking to abtest-allocator (in cluster -> in cluster)
sauron.http.endpoint.HttpEndpointException: java.io.InterruptedIOException: Connection has been shut down

sauron-web talking to ods (in cluster -> on prem)
2019-05-02 13:29:56,204 ERROR u.c.a.o.ODSClientImpl:72 - Logging to ODS failed: Connection has been shut down for message:

arianmotamedi · 2019-05-06T20:59:15Z

Wait, in the logs you posted, C17068 is the one being reset, but I don't see a log saying the connection is being reused. If a reset is happening on new connection, then idle/keep alive timeouts won't have anything to do with the failure. Besides that, these DestinationRule timeouts are applied to requests coming into your service's sidecar; I don't believe sidecar ingress -> local service sees these (you can verify this by checking the cluster configuration for inbound|80|http-web|app.sauron-seo-app.svc.cluster.local).

howardjohn · 2019-05-06T21:07:20Z

We do apply the outlier detection rules to inbound, not sure about the rest. I don't think this actually really does anything though, since we have the minHealthPercent

Stono · 2019-05-06T21:08:26Z

TBH I don't know what the problem is. I'm just observing the failures with the code UC and would like the istio teams help getting to the bottom of it. Other than these logs I don't know what else to look at.

Here are sections from the config dump for that cluster:

{
"version_info": "2019-05-06T20:31:35Z/62",
"cluster": {
"name": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"type": "STATIC",
"connect_timeout": "10s",
"circuit_breakers": {
"thresholds": [
{}
]
},
"metadata": {
"filter_metadata": {
"istio": {
"config": "/apis/networking.istio.io/v1alpha3/namespaces/istio-system/destination-rule/default"
}
}
},
"load_assignment": {
"cluster_name": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "127.0.0.1",
"port_value": 8080
}
}
}
}
]
}
]
}
},
"last_updated": "2019-05-06T20:31:54.631Z"
}

"name": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"validate_clusters": false,
"virtual_hosts": [
{
"name": "inbound|http|80",
"routes": [
{
"decorator": {
"operation": "app.sauron-seo-app.svc.cluster.local:80/*"
},
"route": {
"cluster": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"timeout": "0s",
"max_grpc_timeout": "0s"
},
"match": {
"prefix": "/"
},
"per_filter_config": {
"mixer": {
"mixer_attributes": {
"attributes": {
"destination.service.namespace": {
"string_value": "sauron-seo-app"
},
"destination.service.name": {
"string_value": "app"
},
"destination.service.host": {
"string_value": "app.sauron-seo-app.svc.cluster.local"
},
"destination.service.uid": {
"string_value": "istio://sauron-seo-app/services/app"
}
}
},
"disable_check_calls": true
}
}
}
],
"domains": [
"*"
]
}
]

{
"route_config": {
"name": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"virtual_hosts": [
{
"name": "inbound|http|80",
"domains": [
"*"
],
"routes": [
{
"match": {
"prefix": "/"
},
"route": {
"cluster": "inbound|80|http-web|app.sauron-seo-app.svc.cluster.local",
"timeout": "0s",
"max_grpc_timeout": "0s"
},
"decorator": {
"operation": "app.sauron-seo-app.svc.cluster.local:80/*"
},
"per_filter_config": {
"mixer": {
"disable_check_calls": true,
"mixer_attributes": {
"attributes": {
"destination.service.name": {
"string_value": "app"
},
"destination.service.namespace": {
"string_value": "sauron-seo-app"
},
"destination.service.uid": {
"string_value": "istio://sauron-seo-app/services/app"
},
"destination.service.host": {
"string_value": "app.sauron-seo-app.svc.cluster.local"
}
}
}
}
}
}
]
}
],
"validate_clusters": false
},
"last_updated": "2019-05-06T20:31:54.657Z"
}

Stono · 2019-05-06T21:11:21Z

I mean, these logs don't sound particular healthy?

[2019-05-06 20:22:57.559][30][debug][router] [external/envoy/source/common/router/router.cc:644] [C17068][S9430344543646919123] upstream reset: reset reason connection termination
[2019-05-06 20:22:57.559][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C17068] remote close
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C17068] closing socket: 0
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:244] [C17068] SSL shutdown: rc=1
[2019-05-06 20:22:57.560][30][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C17068]

arianmotamedi · 2019-05-06T21:28:46Z

I mean, these logs don't sound particular healthy?

They don't, it's definitely an issue, but it's not the same issue as others reported that was caused due to an existing connection being closed by upstream without downstream Envoy acknowledging it and returning it to the connection pool, which is what idleTimeout was added for. In those cases, you'd see an explicit log saying the connection is being reused:

[debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:89] [C502] using existing connection

If this isn't in your logs for C17068 then it's not something that idle_timeout/tcp keep_alive settings can fix since those settings don't have any effect on new connections (and as you can see they're not applied to ingress cluster config anyway). According to these logs, this 503 happened from destination Envoy to your service (not source Envoy -> destination Envoy). Are there any logs at all from your sauron-seo-app service around the same time?

Stono · 2019-05-07T06:30:13Z

Ah right so it's separate to the other issues then.

@arianmotamedi those logs above are from sauron-seo-app. In the example above it is consumer-gateway -> sauron-seo-app.

arianmotamedi · 2019-05-07T14:07:27Z

@Stono these are sauron-seo-app's sidecar logs, I meant sauron-seo-app's logs to see if there's any indication that your app forcibly closed a connection, since "upstream" in sauron-seo-app's sidecar refers to the app itself.

The other thing you should check is cluster statistics for inbound|80|http-web|app.sauron-seo-app.svc.cluster.local (hit the /stats endpoint and search for cluster.inbound|80|http-web|app.sauron-seo-app.svc.cluster.local.upstream_) to see if anything other than cx_destroy_remote and cx_destroy_remote_with_active_rq stands out.

Stono · 2019-05-07T14:16:34Z

There is nothing in the application logs about it killing connections, no.

Does envoy -> the app connection pool?
If so, is that pool sending keep alives to the app?

Here are all the cx stats for one of the instances that's just been spawned (2 mins ago) and has had a UC already (just one):

cluster.xds-grpc.upstream_cx_close_notify: 0
cluster.xds-grpc.upstream_cx_connect_attempts_exceeded: 0
cluster.xds-grpc.upstream_cx_connect_fail: 0
cluster.xds-grpc.upstream_cx_connect_timeout: 0
cluster.xds-grpc.upstream_cx_destroy: 0
cluster.xds-grpc.upstream_cx_destroy_local: 0
cluster.xds-grpc.upstream_cx_destroy_local_with_active_rq: 0
cluster.xds-grpc.upstream_cx_destroy_remote: 0
cluster.xds-grpc.upstream_cx_destroy_remote_with_active_rq: 0
cluster.xds-grpc.upstream_cx_destroy_with_active_rq: 0
cluster.xds-grpc.upstream_cx_http1_total: 0
cluster.xds-grpc.upstream_cx_http2_total: 1
cluster.xds-grpc.upstream_cx_idle_timeout: 0
cluster.xds-grpc.upstream_cx_max_requests: 0
cluster.xds-grpc.upstream_cx_none_healthy: 1
cluster.xds-grpc.upstream_cx_overflow: 0
cluster.xds-grpc.upstream_cx_protocol_error: 0
cluster.xds-grpc.upstream_cx_rx_bytes_buffered: 134
cluster.xds-grpc.upstream_cx_rx_bytes_total: 69136
cluster.xds-grpc.upstream_cx_total: 1
cluster.xds-grpc.upstream_cx_tx_bytes_buffered: 0
cluster.xds-grpc.upstream_cx_tx_bytes_total: 34061
cluster.xds-grpc.upstream_cx_connect_ms: P0(nan,8) P25(nan,8.025) P50(nan,8.05) P75(nan,8.075) P90(nan,8.09) P95(nan,8.095) P99(nan,8.099) P99.5(nan,8.0995) P99.9(nan,8.0999) P100(nan,8.1)
cluster.xds-grpc.upstream_cx_length_ms: No recorded values

Stono · 2019-05-07T14:18:00Z

And here are the rest of the stats that are non-zero (you can see the 503):

cluster.xds-grpc.internal.upstream_rq_200: 1
cluster.xds-grpc.internal.upstream_rq_2xx: 1
cluster.xds-grpc.internal.upstream_rq_503: 1
cluster.xds-grpc.internal.upstream_rq_5xx: 1
cluster.xds-grpc.internal.upstream_rq_completed: 2
cluster.xds-grpc.lb_healthy_panic: 1
cluster.xds-grpc.max_host_weight: 1
cluster.xds-grpc.membership_change: 1
cluster.xds-grpc.membership_healthy: 1
cluster.xds-grpc.membership_total: 1
cluster.xds-grpc.ssl.ciphers.ECDHE-RSA-AES128-GCM-SHA256: 1
cluster.xds-grpc.ssl.curves.X25519: 1
cluster.xds-grpc.ssl.handshake: 1
cluster.xds-grpc.ssl.sigalgs.rsa_pss_rsae_sha256: 1
cluster.xds-grpc.ssl.versions.TLSv1.2: 1
cluster.xds-grpc.update_attempt: 96
cluster.xds-grpc.update_no_rebuild: 95
cluster.xds-grpc.update_success: 96
cluster.xds-grpc.upstream_cx_active: 1
cluster.xds-grpc.upstream_cx_http2_total: 1
cluster.xds-grpc.upstream_cx_none_healthy: 1
cluster.xds-grpc.upstream_cx_rx_bytes_buffered: 134
cluster.xds-grpc.upstream_cx_rx_bytes_total: 69270
cluster.xds-grpc.upstream_cx_total: 1
cluster.xds-grpc.upstream_cx_tx_bytes_total: 36311
cluster.xds-grpc.upstream_rq_200: 1
cluster.xds-grpc.upstream_rq_2xx: 1
cluster.xds-grpc.upstream_rq_503: 1
cluster.xds-grpc.upstream_rq_5xx: 1
cluster.xds-grpc.upstream_rq_active: 1
cluster.xds-grpc.upstream_rq_completed: 2
cluster.xds-grpc.upstream_rq_pending_total: 1
cluster.xds-grpc.upstream_rq_total: 1
cluster_manager.active_clusters: 25
cluster_manager.cds.update_attempt: 2
cluster_manager.cds.update_success: 1
cluster_manager.cds.version: 10570931166443169755
cluster_manager.cluster_added: 25
cluster_manager.cluster_updated: 5
http_mixer_filter.total_remote_report_calls: 808
http_mixer_filter.total_remote_report_successes: 808
http_mixer_filter.total_report_calls: 4272
listener_manager.lds.update_attempt: 2
listener_manager.lds.update_success: 1
listener_manager.lds.version: 10570931166443169755
listener_manager.listener_added: 17
listener_manager.listener_create_success: 34
listener_manager.total_listeners_active: 17
server.concurrency: 2
server.days_until_first_cert_expiring: 25
server.live: 1
server.memory_allocated: 9784536
server.memory_heap_size: 19922944
server.total_connections: 20
server.uptime: 475
server.version: 7600795

Stono · 2019-05-08T19:31:06Z

OK this issue can be exacerbated when you have more pods:

Take this:

❯ k get pods
NAME                               READY   STATUS    RESTARTS   AGE
redis-commander-7bbbf67df7-mwsms   1/1     Running   0          6d12h
sauron-seo-app-55b7bd4c45-8vkmp    2/2     Running   0          39m
sauron-seo-app-55b7bd4c45-r5dk5    2/2     Running   0          39m
sauron-seo-app-55b7bd4c45-sbj9m    2/2     Running   0          39m

And then look at the first two requests fail:

This app is getting a very small amount of traffic but enough to prevent time outs:

There's absolutely no pattern to the UC's, so it isn't some sort of fixed point interval drop.

We've removed:

    loadBalancer:
      simple: LEAST_CONN
    outlierDetection:
      baseEjectionTime: 5s
      consecutiveErrors: 5
      interval: 5s
      maxEjectionPercent: 50
      minHealthPercent: 50

Just in case for whatever reason that was interfering - but same behaviour

I'm really not sure what else me and @emedina can do to help debug the problem!

douglas-reid · 2019-05-08T19:41:36Z

assigning to @duderino for triage, etc.

duderino · 2019-05-08T20:02:17Z

by-upstream connection being reused that results in a 503 (recently fixed in Envoy but hasn't been released yet).

@arianmotamedi Can you point me to this PR? We could potentially cherrypick

Stono · 2019-05-08T20:04:26Z

@arianmotamedi that is a problem but i don't believe it is the cause of this issue

duderino · 2019-05-08T20:06:39Z

@Stono we certainly have not proven it's the cause, but I'd like to keep considering it.

@PiotrSikora @silentdai @jplevyak your thoughts please

Stono · 2019-05-20T20:49:49Z

TBH if you had maxRequestsPerConnection not set (so basically permanent connection pooling) and nf_conntrack_generic_timeout was the issue, we'd see UC's every 600seconds, but we don't :-)

I've just unset maxRequestsPerConnection and set idleTimeout to 30s on some of our services, will leave them overnight and see what happens

lambdai · 2019-05-20T20:57:51Z

Could you save the source envoy and dst envoy config somewhere?

kubectl -n <SRC_POD_NAMESPACE> exec -it  <SRC_POD_NAME> -c istio-proxy -- curl localhost:15000/config_dump

and the dst pod.

My concern is that pilot would interpret infinite timeout as as hard-coded number instead a infinitity_max. I see this pattern a lot.

Stono · 2019-05-20T20:59:10Z

idle timeout definitely works btw, you can see here where i set it to a low value (30s) and as a result get more connection churn into the destination app

Stono · 2019-05-20T21:04:38Z

I've sent you the config dumps on slack

lambdai · 2019-05-20T21:12:55Z

That shouldn't be good. IIUC active request should remain the same while the connection rate is higher

Stono · 2019-05-20T21:17:08Z

Yes that's what you can see there; connection rate goes up active requests stay around the same, 15s sample rate.

Stono · 2019-05-21T06:52:15Z

https://www.google.com/search?q=envoy+lots+of+%22upstream_cx_destroy_local_with_active_rq%22&oq=envoy+lots+of+%22upstream_cx_destroy_local_with_active_rq%22&aqs=chrome..69i57.4686j0j7&sourceid=chrome&ie=UTF-8

lambdai · 2019-05-22T17:33:29Z

Update: Unfortunately I didn't reproduce. I have some clue in my experiment.
What we should expect with the above configs

established connection between client envoy and server envoy.
Between the client sidecar envoy and server envoy: at least one established connection. Initially created when the client app create the first request.
When the connection is closed?
It could have multiple causes. The consequences vary a lot. I cannot enumerate all.

idletimeout: client envoy would close the client. As the connection is idle no request should be on the wire. Client envoy will close the connection.
Force close by server envoy. If there is no request on the wire, the next request from client should not see 503. Of course it is 503 if there is data on the wire. Envoy doesn't use pipeline on http 1.1 so at most 1 request is affected.
Health check to server envoy failure. I see this from the code. It's a source of UC503. I would rather to prevent a failed health check instead of dig into the exact behaviors after hitting that failure.

lambdai · 2019-05-22T17:40:56Z

@Stono We should not see "remote early close" unless the service is jeopardized. That's indicating expect further data when the connection is closed. We shouldn't see this as you describe it as low rps. Is there any evidence that client(xx-gateway) or the server(seo) is overloaded at the UC 503? It might not be an overall loaded but a single request congestion.

Stono · 2019-05-22T17:45:51Z

That's a shame. I've had 475 in the last 24 hours (on 112,626,354 requests, across 410 envoys):

It's nothing to do with the application being overloaded (it really isn't), in fact two of our top offenders have very little requests per second (6). I thought we had also demonstrated that the issue is between source envoy and destination envoy, not between destination envoy and destination app?

I don't know if it helps debug at all, but here are some envoy graphs for the two top offenders:

sauron-seo-app:

sauron-graphql-server:

And the another app which gets consistently quite a bit more traffic, but has had no UC's (search-one):

They're all running with the same Destinationrule configuration:

  spec:
    host: app
    trafficPolicy:
      connectionPool:
        http:
          http1MaxPendingRequests: 1024
          http2MaxRequests: 10240
          idleTimeout: 30m
        tcp:
          maxConnections: 10240
          tcpKeepalive:
            interval: 30s
            time: 30s

Stono · 2019-05-22T17:49:18Z

I think the key take away of this issue is that you're not going to be able to replicate the incredibly diverse setup of each of your customers environments, but your customers are going to experience issues.

I think the focus should be around providing tooling, debugging capability, better metrics (cc @douglas-reid), improved logging and so on in order to help diagnose issues.

douglas-reid · 2019-05-22T18:15:27Z

@Stono This is a bit hard to digest, but if there are metrics that would help further, i'd love to brainstorm on their generation.

Stono · 2019-05-22T19:22:47Z

We see more of these as connections ramp up.

Following a deployment, here are the 503s:

And here is the connections TO the envoy in front of sauron-seo-app:

Stono · 2019-05-22T19:42:24Z

And again:

PiotrSikora · 2019-05-23T08:45:25Z

@Stono could you provide logs from all involved peers (i.e. source app, source Envoy, destination Envoy, destination app) for the same 503 UC event (with a few minutes of logs before and after for context)? I'm trying to figure out which connection is terminated first and starts the chain of errors, but it's unclear from the small pieces of logs from different events, and I couldn't find any event with logs from multiple peers.

You're saying that the issue is between sidecars:

I thought we had also demonstrated that the issue is between source envoy and destination envoy, not between destination envoy and destination app?

But in the logs that you provided, the destination Envoy sees an upstream connection (i.e. connection to the destination app) being reset:

[2019-05-06 20:22:57.559][30][debug][router] [external/envoy/source/common/router/router.cc:644] [C17068][S9430344543646919123] upstream reset: reset reason connection termination

Similarly, the tcpdump that @chrismckean captured shows that the destination app resets connection to the destination Envoy, doesn't it?

That suggests that the issue is with the connection to the destination app, and not between sidecars.

Stono · 2019-05-23T09:05:37Z

@PiotrSikora i'm sorry but i'm not spending another whole day grabbing logs.
I've provided ridiculous amounts of information to numerous people on this issue, other issues, via slack etc. I really believe there are multiple issues going on!

Please let's just get on this call later and live debug together. You'll be able to ask questions and we can answer them real time.

The really frustrating thing for me is i'm not able to replicate this in a testing environment, only on a busy cluster. Which I can show you later.

AWKIF · 2019-05-29T13:10:34Z

Hello,
We have the same issue, i read all opened tickets about this. One thing is that we don't use sidecar, we notice some sporadic 503's (mostly from our external monitoring tool). we were using istio 1.0.2 for some time now.
I installed 1.1.7 yesterday and we still have this issue,

logs from istio-ingressgateway:

{"log":"[2019-05-29T11:48:56.105Z] \"GET / HTTP/1.1\" 503 UC \"-\" 0 95 288 - \"10.234.76.0\" \"Go-http-client/1.1\" \"0bede68c-a8a3-4d84-a8e9-1b0ea9f21beb\" \"npm.deez.re\" \"10.234.41.169:4873\" outbound|4873||verdaccio.coreinfra.svc.dev.kubernetes - 10.234.23.18:80 10.234.76.0:54740 -\n","stream":"stdout","time":"2019-05-29T11:49:03.638779538Z"}

logs from the app(verdaccio) at the same time:

{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:53.279570381Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.1 via 10.234.23.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:53.430792718Z"}
{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:54.297110264Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.0 via 10.234.76.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:54.448685695Z"}
{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:55.31475736Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.0 via 10.234.76.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:55.466233909Z"}
{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:56.675295282Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.1 via 10.234.23.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:56.675736777Z"}
{"log":" http \u003c-- 200, user: null(10.234.41.1), req: 'GET /-/ping', bytes: 0/3\n","stream":"stdout","time":"2019-05-29T11:48:57.38159729Z"}
{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:57.692149781Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.0 via 10.234.76.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:57.692597648Z"}
{"log":" http \u003c-- 200, user: null(10.234.61.0), req: 'GET /', bytes: 0/1042\n","stream":"stdout","time":"2019-05-29T11:48:58.710610856Z"}
{"log":" http \u003c-- 200, user: null(10.234.23.0 via 10.234.76.0), req: 'HEAD /', bytes: 0/0\n","stream":"stdout","time":"2019-05-29T11:48:58.711203625Z"}
{"log":" http \u003c-- 200, user: null(10.234.76.0 via 10.234.23.0), req: 'GET /', bytes: 0/454\n","stream":"stdout","time":"2019-05-29T11:48:58.841457719Z"}

it happens for others services/apps as well when using istio.
NodePort services (not using the ingress) doesn't have this issue
We ran tests all nights:
curl on pod directly => 200 only
curl on ingress => some 503

Istio was installed with these:
helm template install/kubernetes/helm/istio --name istio --namespace istio-system --set global.proxy.accessLogFile="/dev/stdout" --set global.proxy.clusterDomain=dev.kubernetes --set prometheus.service.nodePort.enabled=true --set sidecarInjectorWebhook.enabled=false --set global.proxy.includeIPRanges=100.64.0.0/16 --set gateways.istio-ingressgateway.type=NodePort --set gateways.istio-ingressgateway.autoscaleMin=2 --set pilot.autoscaleMin=2 --set gateways.istio-ingressgateway.nodeSelector.node-role\\.kubernetes\\.io/master=\"true\" > istio1.1.7.yaml

If you have any questions, i'd like to fix this asap
thanks

AWKIF · 2019-05-29T15:57:19Z

Some more logs in debug:

"log":"[2019-05-29 15:32:47.807][182][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1129] **no healthy host for HTTP connection pool**\n","stream":"stderr","time":"2019-05-29T15:32:47.807683175Z"}
{"log":"[2019-05-29 15:32:47.807][182][debug][http] [external/envoy/source/common/http/async_client_impl.cc:94] async http request response headers (end_stream=false):\n","stream":"stderr","time":"2019-05-29T15:32:47.807691662Z"}
{"log":"':status', '503'\n","stream":"stderr","time":"2019-05-29T15:32:47.807717111Z"}
{"log":"'content-length', '19'\n","stream":"stderr","time":"2019-05-29T15:32:47.807726288Z"}
{"log":"'content-type', 'text/plain'\n","stream":"stderr","time":"2019-05-29T15:32:47.807732686Z"}
{"log":"\n","stream":"stderr","time":"2019-05-29T15:32:47.807738813Z"}
{"log":"[2019-05-29 15:32:47.807][182][trace][http] [external/envoy/source/common/http/async_client_impl.cc:102] async http request response data (length=19 end_stream=true)\n","stream":"stderr","time":"2019-05-29T15:32:47.80774496Z"}
{"log":"[2019-05-29 15:32:47.854][187][debug][main] [external/envoy/source/server/connection_handler_impl.cc:257] [C941004] new connection\n","stream":"stderr","time":"2019-05-29T15:32:47.854635313Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][connection] [external/envoy/source/common/network/connection_impl.cc:440] [C941004] socket event: 3\n","stream":"stderr","time":"2019-05-29T15:32:47.854659785Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][connection] [external/envoy/source/common/network/connection_impl.cc:508] [C941004] write ready\n","stream":"stderr","time":"2019-05-29T15:32:47.854667128Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][connection] [external/envoy/source/common/network/connection_impl.cc:478] [C941004] read ready\n","stream":"stderr","time":"2019-05-29T15:32:47.854673563Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:23] [C941004] read returns: 177\n","stream":"stderr","time":"2019-05-29T15:32:47.854679934Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][connection] [external/envoy/source/common/network/raw_buffer_socket.cc:37] **[C941004] read error: Resource temporarily unavailable**\n","stream":"stderr","time":"2019-05-29T15:32:47.854686252Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:363] [C941004] parsing 177 bytes\n","stream":"stderr","time":"2019-05-29T15:32:47.854692705Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:476] [C941004] message begin\n","stream":"stderr","time":"2019-05-29T15:32:47.854701344Z"}
{"log":"[2019-05-29 15:32:47.854][187][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:243] [C941004] new stream\n","stream":"stderr","time":"2019-05-29T15:32:47.854744225Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:331] [C941004] completed header: key=Host value=rundeck-cac.deez.re\n","stream":"stderr","time":"2019-05-29T15:32:47.854769334Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:331] [C941004] completed header: key=User-Agent value=Go-http-client/1.1\n","stream":"stderr","time":"2019-05-29T15:32:47.854777073
Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:331] [C941004] completed header: key=Referer value=http://rundeck-cac.deez.re/user/login\n","stream":"stderr","time":"2019-05-29T15
:32:47.854783338Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:331] [C941004] completed header: key=Accept-Encoding value=gzip\n","stream":"stderr","time":"2019-05-29T15:32:47.854789387Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:442] [C941004] headers complete\n","stream":"stderr","time":"2019-05-29T15:32:47.854795349Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:331] [C941004] completed header: key=Connection value=close\n","stream":"stderr","time":"2019-05-29T15:32:47.854801091Z"}
{"log":"[2019-05-29 15:32:47.854][187][trace][http] [external/envoy/source/common/http/http1/codec_impl.cc:463] [C941004] message complete\n","stream":"stderr","time":"2019-05-29T15:32:47.854806925Z"}
{"log":"[2019-05-29 15:32:47.854][187][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:580] [C941004][S14732545832523136882] request headers complete (end_stream=true):\n","stream":"stderr","time":"2019-05-29T15:32:47
.854824358Z"}
{"log":"':authority', 'rundeck-cac.deez.re'\n","stream":"stderr","time":"2019-05-29T15:32:47.854834945Z"}
{"log":"':path', '/user/login'\n","stream":"stderr","time":"2019-05-29T15:32:47.854841069Z"}
{"log":"':method', 'GET'\n","stream":"stderr","time":"2019-05-29T15:32:47.85484644Z"}
{"log":"'user-agent', 'Go-http-client/1.1'\n","stream":"stderr","time":"2019-05-29T15:32:47.854851924Z"}
{"log":"'referer', 'http://rundeck-cac.deez.re/user/login'\n","stream":"stderr","time":"2019-05-29T15:32:47.854857273Z"}
{"log":"'accept-encoding', 'gzip'\n","stream":"stderr","time":"2019-05-29T15:32:47.854862577Z"}
{"log":"'connection', 'close'\n","stream":"stderr","time":"2019-05-29T15:32:47.854867935Z"}
{"log":"\n","stream":"stderr","time":"2019-05-29T15:32:47.854873194Z"}
{"log":"[2019-05-29 15:32:47.854][187][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1040] [C941004][S14732545832523136882] request end stream\n","stream":"stderr","time":"2019-05-29T15:32:47.854878415Z"}

Also noting this few lines after:
{"log":"[2019-05-29 15:32:47.855][187][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:20] queueing request due to no available connections\n","stream":"stderr","time":"2019-05-29T15:32:47.855184357Z"}

lambdai · 2019-05-30T00:30:59Z

istio/envoy#73 fixes the known issue. PR in testing.
tl;dr envoy could use the half close connection talking to upstream and respond 503 to downstream

Stono · 2019-05-31T22:04:20Z

Along with silentdai's PR, we also discovered that applications which have a low socket timeout value will terminate connections to envoy more frequently, exacerbating the issue. I've written about it here: https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times

The TLDR; here is that @silentdai's update (hopefully in 1.1.8) combined with an awareness and adjustment of applications with low tcp socket timeouts resolves 99.99% of our 503's.

I'm going to close this issue now as it's become a bit of a novel, I'm thankful to the whole Istio team for their efforts helping get to the bottom of this one and I know there are more improvements to come from them which should make this less of a problem.

alpaca2333 · 2019-06-25T03:12:54Z

I have come through both "timeout" and "upstream disconnect/reset" errors, which compose about 2% of all requests. I have no idea if we are going into the same problem. What's annoying, the "timeout" requests could be referred from the log of envoy while the "upstream disconnect" could not.

turbotankist · 2019-12-12T08:20:19Z

Do you have node.js application behind envoy?
Node.js has 5 sec default connection idleTimeout, and envoy has the same one.
In rare time it generate this error - envoy send reqest in the moment when nodejs closing it.

We have resolve this with DestinationRule:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: qqq-destination-rule
spec:
  host: qqq.aaa.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        idleTimeout: 3s
        maxRetries: 3

richardscarrott · 2020-03-13T18:01:35Z

Thanks for all the info @Stono; You mentioned 1.1.8 will (hopefully) have the fix for the UC 503s, I wondered if you have a link to that PR as I'm interested in the solution?

The reason I ask is because we originally solved the issue by turning on envoy retries with x-envoy-max-retries: 10; x-envoy-retry-on: gateway-error,connect-failure,refused-stream which worked fine.

However, we've since had a few incidents in production whereby a third party service times out and we end up creating massive cascading request chains in production (our Nodejs app also performs retries to third parties which compounds the issue 🤦‍♂).

We therefore have no choice but to disable envoy retries but before we do this I wanted to check whether the 1.1.8 fix solved the underlying issue or if we should expect to see the UC 503s again?

yan-dblinf · 2020-09-02T22:36:34Z

Node.js has 5 sec default connection idleTimeout, and envoy has the same one.

@turbotankist which version of istio are you using? The default idle timeout in Destination Rule is 1 hour after 1.4.

aaronbbrown · 2022-12-21T19:39:11Z

Apologies for commenting on this long-closed issue, but I was also experiencing this problem (w/ 1.15). Because some of our traffic comes in via NodePorts directly to the service, there's no outbound rule that I can set for retries, so I ended up adding this EnvoyFilter, which retries upstream reset connections rather than expecting the client to do so:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: passthrough-retries
  namespace: myapp
spec:
  workloadSelector:
    labels:
      app: myapp
  configPatches:
  - applyTo: HTTP_ROUTE
    match:
      context: SIDECAR_INBOUND
      listener:
        portNumber: 8080
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
            subFilter:
              name: "envoy.filters.http.router"
    patch:
      operation: MERGE
      value:
        route:
          retry_policy:
            retry_back_off:
              base_interval: 10ms
            retry_on: reset
            num_retries: 2

After applying ☝🏻 to my service, the random 503/UC errors seem to have gone away regardless of the client or ingress method.

Hopefully this is helpful to someone!

howardjohn added the area/networking label May 6, 2019

Stono mentioned this issue May 6, 2019

Istio default retry strategy is transparently retrying POST requests that connected fine #13851

Open

Stono mentioned this issue May 7, 2019

Sporadic 503 errors #9113

Closed

douglas-reid added the kind/customer issue label May 8, 2019

douglas-reid assigned duderino May 8, 2019

Stono changed the title ~~Almost every app gets UC errors~~ Almost every app gets UC errors, 0.012% of all requests in 24h period May 8, 2019

Stono mentioned this issue May 22, 2019

Does idleTimeout even work? #14310

Closed

Stono closed this as completed May 31, 2019

Stono mentioned this issue May 31, 2019

Retrying requests on Keep-Alive timeout instead of throwing HTTP 503 envoyproxy/envoy#1979

Closed

jplevyak mentioned this issue Jun 4, 2019

Delay connection reuse for a poll cycle to catch closed connections. envoyproxy/envoy#7159

Closed

enbohm mentioned this issue Nov 5, 2019

Sporadic 503 responses from Quarkus MP application quarkusio/quarkus#4572

Closed

Almost every app gets UC errors, 0.012% of all requests in 24h period #13848

Almost every app gets UC errors, 0.012% of all requests in 24h period #13848

Comments

Stono commented May 6, 2019 • edited

howardjohn commented May 6, 2019

Stono commented May 6, 2019 • edited

Stono commented May 6, 2019 • edited

Stono commented May 6, 2019

arianmotamedi commented May 6, 2019

Stono commented May 6, 2019 • edited

Stono commented May 6, 2019

Stono commented May 6, 2019 • edited

Stono commented May 6, 2019

Stono commented May 6, 2019 • edited

arianmotamedi commented May 6, 2019

howardjohn commented May 6, 2019

Stono commented May 6, 2019 • edited

Stono commented May 6, 2019

arianmotamedi commented May 6, 2019

Stono commented May 7, 2019

arianmotamedi commented May 7, 2019

Stono commented May 7, 2019

Stono commented May 7, 2019 • edited

Stono commented May 8, 2019 • edited

douglas-reid commented May 8, 2019

duderino commented May 8, 2019 • edited

Stono commented May 8, 2019

duderino commented May 8, 2019

Stono commented May 20, 2019 • edited

lambdai commented May 20, 2019

Stono commented May 20, 2019 • edited

Stono commented May 20, 2019

lambdai commented May 20, 2019

Stono commented May 20, 2019

Stono commented May 21, 2019

lambdai commented May 22, 2019

lambdai commented May 22, 2019 • edited

Stono commented May 22, 2019 • edited

Stono commented May 22, 2019 • edited

douglas-reid commented May 22, 2019

Stono commented May 22, 2019

Stono commented May 22, 2019

PiotrSikora commented May 23, 2019 • edited

Stono commented May 23, 2019 • edited

AWKIF commented May 29, 2019 • edited

AWKIF commented May 29, 2019 • edited

lambdai commented May 30, 2019

Stono commented May 31, 2019 • edited

alpaca2333 commented Jun 25, 2019

turbotankist commented Dec 12, 2019

richardscarrott commented Mar 13, 2020

yan-dblinf commented Sep 2, 2020

aaronbbrown commented Dec 21, 2022

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 6, 2019 •

edited

Stono commented May 7, 2019 •

edited

Stono commented May 8, 2019 •

edited

duderino commented May 8, 2019 •

edited

Stono commented May 20, 2019 •

edited

Stono commented May 20, 2019 •

edited

lambdai commented May 22, 2019 •

edited

Stono commented May 22, 2019 •

edited

Stono commented May 22, 2019 •

edited

PiotrSikora commented May 23, 2019 •

edited

Stono commented May 23, 2019 •

edited

AWKIF commented May 29, 2019 •

edited

AWKIF commented May 29, 2019 •

edited

Stono commented May 31, 2019 •

edited