Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

503 "upstream connect error or disconnect/reset before headers" in 1.1 with low traffic #13205

Closed
jaygorrell opened this issue Apr 10, 2019 · 35 comments

Comments

@jaygorrell
Copy link
Contributor

jaygorrell commented Apr 10, 2019

Describe the bug
Many services in our environment are experiencing a handful (~1% or so) of 503 errors with the response "upstream connect error or disconnect/reset before headers". I have rolled istio out to two environments and they both experience similar issues.

These environments do not have any Istio gateways -- it's just service to service communication; however, I did replicate the issue using curl through k8s ingress (outside mesh) to a mesh service so I assume it's isolated to the destination's sidecar, which returns the 503.

Digging through some older issues, it appears this is related to a keepalive timeout, which does fit my experience. It's usually my first curl after a long break that fails, if that makes sense with anything. In 1.1, there was a default retry added that I believe was supposed to handle this case but it doesn't appear to be kicking in for me.

Is there anything that needs to be enabled for the default retry to take effect? Policy checks, for example. I also don't have Virtual Services or Destination Rules added yet but I was told those are needed for it either. I did do a quick test by adding one and it didn't help.

Here's a curl example (through k8s ingress) that I tried this morning after being Idle all night:

 ~ curl -v https://my-service.stg.example.com/v1/available-inventory/201904091406 https://my-service.stg.example.com/v1/available-inventory/201904091406                                                            5m
*   Trying 10.20.23.98...
* TCP_NODELAY set
* Connected to my-service.stg.example.com (10.20.23.98) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=District of Columbia; L=Washington; O=MyCompany, LLC; OU=Engineering; CN=*.stg.example.com
*  start date: Jan 17 00:00:00 2017 GMT
*  expire date: Mar 25 12:00:00 2020 GMT
*  subjectAltName: host "my-service.stg.example.com" matched cert's "*.stg.example.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fc42400d400)
> GET /v1/available-inventory/201904091406 HTTP/2
> Host: my-service.stg.example.com
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 503
< server: nginx/1.15.8
< date: Wed, 10 Apr 2019 13:36:44 GMT
< content-type: text/plain
< content-length: 57
< x-envoy-decorator-operation: my-service.default.svc.cluster.local:80/*
< strict-transport-security: max-age=15724800; includeSubDomains
<
* Connection #0 to host my-service.stg.example.com left intact
upstream connect error or disconnect/reset before headers* Found bundle for host my-service.stg.example.com: 0x7fc423c1d2e0 [can multiplex]
* Re-using existing connection! (#0) with host my-service.stg.example.com
* Connected to my-service.stg.example.com (10.20.23.98) port 443 (#0)
* Using Stream ID: 3 (easy handle 0x7fc42400d400)
> GET /v1/available-inventory/201904091406 HTTP/2
> Host: my-service.stg.example.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/2 200
< server: nginx/1.15.8
< date: Wed, 10 Apr 2019 13:36:44 GMT
< content-type: application/json; charset=utf-8
< content-length: 18
< x-powered-by: Express
< etag: W/"12-0TGR6gSl8vAg7WE2Lc4UXbiP8BI"
< x-envoy-upstream-service-time: 4
< x-envoy-decorator-operation: my-service.default.svc.cluster.local:80/*
< strict-transport-security: max-age=15724800; includeSubDomains
<
* Connection #0 to host my-service.stg.example.com left intact
{"inventory":null}

Expected behavior
503s with "upstream connect error or disconnect/reset before headers" are retried by default.

Steps to reproduce the bug
Not completely sure, but some key information:

  • No Virtual Services or Destination Rules
  • 560 pods in 200 services
  • Node services mostly (which I understand have some keepalive-related defaults in some Istio issues)

Version
Istio 1.1.2
K8S 1.11.8

Installation
Helm template / kubectl apply; nothing of note in the helm value overrides -- just setting a minimum of 2 for istio pods, rewriting http probes, enabling mtls (permissive), and enabling grafana/kiali.

Environment
AWS / Kops

@arianmotamedi
Copy link
Contributor

Expected behavior
503s with "upstream connect error or disconnect/reset before headers" are retried by default.

Unfortunately this is not true - 503s cannot be automatically retried due to the risky nature of non-idempotent requests. You can define a virtual service and use route-level retries if you're confident multiple identical requests made to the server won't have any adverse effects, but this is not something Envoy would have knowledge of on its own.

Ability to define idle timeout values for upstream connections is currently waiting to be merged in 1.1: #13146. Once it makes it in you can play around with it to see if it helps with the 503s. Also see #9113 for a similar discussion around this issue.

@jaygorrell
Copy link
Contributor Author

Thanks @arianmotamedi. I've been all over those issues and that's how I was able to get as far as I am now.

The default retry was added for:
connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes.

Does that not capture the "upstream connect error or disconnect/reset before headers" issue? I'd be happy to add a custom one, but I thought connect-failure was going to capture this. If not that one, which retryOn is needed?

Also, where exactly is the idle timeout happening? I still have an understanding gap there. Basically, what is the default and what side is timing out early? I'm guessing the node services are closing connections but I'm not sure how long they're attempted to be used for anyway.

@arianmotamedi
Copy link
Contributor

The default retry was added for:
connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes

Did you add this yourself? For these to work you'll need to pass the x-envoy-retry-on header to your egress requests: https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/router_filter#x-envoy-retry-on.

In most cases, these 503s seem to happen when an upstream service closes a connection but Envoy still thinks the connection is reusable and puts it back in the coonection pool. Idle timeout was implemented to address this issue: you set it in your cluster definition (it'll be a DestinationRule in case of Istio) to a value lower than what the server typically uses to close connections, this way the connection pool removes it before the upstream server closes it. This of course comes with the overhead of having to reestablish a new connection for future requests if a connection expires after being idle for T seconds, which could be especially noticeable for tls connections; so there's a trade-off here.

To see what's causing the 503s in your specific case you'll have to run a tcpdump and look at captured tcp traffic, but based on the symptoms you described (it occurring after being idle for some time) it's likely it's a keepalive issue and not a connect failure case.

@jaygorrell
Copy link
Contributor Author

Hmm, I'm referring to #10566, which was in 1.1.0. Are you saying I'm still missing something required for that to kick in? My understanding is that it applies to all requests by default and handles 503s caused by connection-failure, among others. There's a comment where I asked about it in the bottom of that PR.

@arianmotamedi
Copy link
Contributor

Oh interesting, I was not aware of changes that made it into 1.1. In any case, there's no indication here that the 503s are happening because of a connect failure vs upstream reset. You won't know that for sure unless you do a packet capture. Also check sidecar proxy's config and see what this retry policy is actually being translated to.

@Stono
Copy link
Contributor

Stono commented Apr 10, 2019

Considering you're able to consistently repro your 503, I would personally turn on sidecar debug logs on both your source and destination by port-fowrarding 15000 and doing curl -XPOST http://127.0.0.1:15000/logging?level=debug, then tailing the logs -c istio-proxy for both sidecars, and then make your request.

The logs are super verbose but should give some insight.

@jaygorrell
Copy link
Contributor Author

Thanks for the tips. I wouldn't say I can consistently repro it. It's common enough to annoy users but just infrequent enough to be difficult to catch. I'll give it a shot though.

I'm noticing now that no matter how hard I try when the cluster is virtually no traffic at night, it's very difficult to replicate. While during the day, I could replicate it by hitting a /ping endpoint that just responds with a 200 OK.

Something I'm just now realizing is that the services that have the problem are all services that can take quite long to respond. I'm wondering if after a request is kicked off that takes 40s to respond (and Istio times out at the 15s default), if there are any scenarios where subsequent requests would hit this issue? It was relatively common during business hours with the /ping endpoint, likely while other users were hitting the longer request endpoints. Just to be clear, these long-running requests are not a problem without going through Istio.

The 15s timeout on longer requests is a problem of it's own but that one is easy enough to solve.

@arianmotamedi
Copy link
Contributor

A timeout would give you a 504, not 503, so I highly doubt that it's related. But as Stono suggested, you can turn on debug logs and to see if anything stands out.

@jaygorrell
Copy link
Contributor Author

The 504 would be for the request that actually timed out. I was referring to subsequent requests.

I'll be trying to capture logs from the issue today with traffic ramping back up.

@jaygorrell
Copy link
Contributor Author

jaygorrell commented Apr 11, 2019

Alright, I was able to capture some 503s with verbose logging on (Thanks @Stono!). I have no idea which pieces are really helpful though so here's a relevant section if it helps.

[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:381] [C5][S11598936829995536530] router decoding headers:
':authority', 'my-service.default.svc.cluster.local'
':path', '/v1/roundtrip/outbound'
':method', 'GET'
':scheme', 'http'
'accept', 'application/json, text/plain, */*'
'user-agent', 'axios/0.18.0'
'x-amzn-trace-id', 'Root=1-5caf6436-4e5f73b975cabba30a25f882;Parent=6e6ec6a9cff612fc;Sampled=1'
'x-forwarded-proto', 'http'
'content-length', '0'
'x-b3-traceid', 'f9d3cbc2fa8ab83cff033d4316118097'
'x-b3-spanid', 'fa4c5a176966c797'
'x-b3-parentspanid', 'ff033d4316118097'
'x-b3-sampled', '0'

[2019-04-11 15:58:46.200][23][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:93] [C73] using existing connection
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:1128] [C5][S11598936829995536530] pool ready
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C73] remote close
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C73] closing socket: 0
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C73] disconnect. resetting 1 pending requests
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C73] request reset
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:538] [C5][S11598936829995536530] upstream reset
[2019-04-11 15:58:46.200][23][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-04-11 15:58:46.200][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1235] [C5][S11598936829995536530] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '57'
'content-type', 'text/plain'
'date', 'Thu, 11 Apr 2019 15:58:45 GMT'
'server', 'istio-envoy'

[2019-04-11 15:58:46.200][23][debug][filter] [src/envoy/http/authn/http_filter.cc:42] Called AuthenticationFilter : onDestroy
[2019-04-11 15:58:46.200][23][debug][filter] [src/envoy/http/mixer/filter.cc:205] Called Mixer::Filter : onDestroy state: 2
[2019-04-11 15:58:46.200][23][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:123] [C73] client disconnected
[2019-04-11 15:58:46.200][23][debug][filter] [src/envoy/http/mixer/filter.cc:219] Called Mixer::Filter : log
[2019-04-11 15:58:46.200][23][debug][filter] [./src/envoy/http/mixer/report_data.h:132] No dynamic_metadata found for filter envoy.filters.http.rbac
[2019-04-11 15:58:46.200][23][debug][filter] [src/istio/control/client_context_base.cc:125] Report attributes: attributes {
  key: "connection.mtls"
  value {
    bool_value: false
  }
}
attributes {
  key: "context.protocol"
  value {
    string_value: "http"
  }
}
attributes {
  key: "context.prox
[2019-04-11 15:58:47.203][23][debug][filter] [./src/envoy/utils/header_update.h:46] Mixer forward attributes set: CkoKCnNvdXJjZS51aWQSPBI6a3ViZXJuZXRlczovL2Fpci1zaG9wcGluZy1zZXJ2aWNlLTY0YjY2YmJmNDctYmhyYnguZGVmYXVsdA==
[2019-04-11 15:58:47.203][23][debug][router] [external/envoy/source/common/router/router.cc:320] [C0][S7878944405247577882] cluster 'outbound|15004||istio-telemetry.istio-system.svc.cluster.local' match for URL '/istio.mixer.v1.Mixer/Report'

There is much more but it's very verbose -- the big list of attributes and default_words. I left that out but please let me know if there's something more particular I should be looking for.

I assume this bit is the key part:

[2019-04-11 15:58:46.200][23][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:93] [C73] using existing connection
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:1128] [C5][S11598936829995536530] pool ready
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C73] remote close
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C73] closing socket: 0
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C73] disconnect. resetting 1 pending requests
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C73] request reset
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:538] [C5][S11598936829995536530] upstream reset

Does "resetting 1 pending requests" mean the retry did attempt to go through?

@arianmotamedi
Copy link
Contributor

arianmotamedi commented Apr 11, 2019

There it is:

[2019-04-11 15:58:46.200][23][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:93] [C73] using existing connection
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:1128] [C5][S11598936829995536530] pool ready
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C73] remote close
[2019-04-11 15:58:46.200][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C73] closing socket: 0
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C73] disconnect. resetting 1 pending requests
[2019-04-11 15:58:46.200][23][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C73] request reset
[2019-04-11 15:58:46.200][23][debug][router] [external/envoy/source/common/router/router.cc:538] [C5][S11598936829995536530] upstream reset

It's using an existing connection from the pool, so not a connect issue. This looks exactly like the case where upstream has closed a connection but Envoy is still keeping it around in the pool. Highly likely it's related to envoyproxy/envoy#6190, which is a known bug. You should be able to use idle_timeout to alleviate most of these 503s. It was merged into the 1.1 branch, so should be available on the next release (1.1.3?)

Does "resetting 1 pending requests" mean the retry did attempt to go through?

No, that's just what happens when there is a local or remote close event: https://github.com/envoyproxy/envoy/blob/cbf03b90e8aa1c3476b493d8629abe7fc82b22c9/source/common/http/codec_client.cc#L79. I don't see anything in the logs that the request was retried - can you dump sidecar proxy's config? curl http://127.0.0.1:15000/config_dump

@jaygorrell
Copy link
Contributor Author

jaygorrell commented Apr 11, 2019

I see, thank you!

I guess my next question is if there is any mitigation I can do on the server side? Does envoy have a default idle timeout for the pool that I would need the server to exceed? It's node, which defaults to 5s.

Also I'm still unclear why the default retry in 1.1 didn't mitigate the issue. (edit: just saw your edit - ignore this... getting the dump)

@arianmotamedi
Copy link
Contributor

arianmotamedi commented Apr 11, 2019

Does envoy have a default idle timeout for the pool that I would need the server to exceed

No and that's the issue - there is no idle timeout by default. Changes in #13146 will allow you to specify one though, so as soon as it's released you should be able to either set this to < 5s for services that call node or if this is a production environment and performance matters set to like 15s and have your node service keep the connection open for 30s.

Also I'm still unclear why the default retry in 1.1 didn't mitigate the issue.

This is why I asked if you could get a config dump of the sidecar to see if Pilot is passing the correct route retry configuration to Envoy.

@jaygorrell
Copy link
Contributor Author

@arianmotamedi Is there a specific section I should look for? It's 35k lines but I'll dump it all if that's useful.

@jaygorrell
Copy link
Contributor Author

jaygorrell commented Apr 11, 2019

I think I found it:

{
  "name": "my-service.default.svc.cluster.local:80",
  "domains": [
    "my-service.default.svc.cluster.local",
    "my-service.default.svc.cluster.local:80",
    "my-service",
    "my-service:80",
    "my-service.default.svc.cluster",
    "my-service.default.svc.cluster:80",
    "my-service.default.svc",
    "my-service.default.svc:80",
    "my-service.default",
    "my-service.default:80",
    "100.68.80.151",
    "100.68.80.151:80"
  ],
  "routes": [
    {
      "match": {
        "prefix": "/"
      },
      "route": {
        "cluster": "outbound|80||my-service.default.svc.cluster.local",
        "timeout": "60s",
        "retry_policy": {
          "retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
          "num_retries": 2,
          "retry_host_predicate": [
            {
              "name": "envoy.retry_host_predicates.previous_hosts"
            }
          ],
          "host_selection_retry_max_attempts": "3"
        },
        "max_grpc_timeout": "60s"
      },
      "metadata": {
        "filter_metadata": {
          "istio": {
            "config": "/apis/networking/v1alpha3/namespaces/default/virtual-service/my-service"
          }
        }
      },
      "decorator": {
        "operation": "my-service.default.svc.cluster.local:80/*"
      },
      "per_filter_config": {
        "mixer": {
          "forward_attributes": {
            "attributes": {
              "destination.service.name": {
                "string_value": "my-service"
              },
              "destination.service.namespace": {
                "string_value": "default"
              },
              "destination.service.host": {
                "string_value": "my-service.default.svc.cluster.local"
              },
              "destination.service.uid": {
                "string_value": "istio://default/services/my-service"
              }
            }
          },
          "mixer_attributes": {
            "attributes": {
              "destination.service.namespace": {
                "string_value": "default"
              },
              "destination.service.name": {
                "string_value": "my-service"
              },
              "destination.service.uid": {
                "string_value": "istio://default/services/my-service"
              },
              "destination.service.host": {
                "string_value": "my-service.default.svc.cluster.local"
              }
            }
          },
          "disable_check_calls": true
        }
      }
    }
  ]
}

Though this is an override specifically for the service having the issue, just in case the default policy wasn't working.

edit: Here's the retry policy applied to all other services that I have not added a VirtualService for -- this is what would have applied to this one yesterday before I started trying to debug.

{
  "name": "other-services.default.svc.cluster.local:80",
  "domains": [
    "other-services.default.svc.cluster.local",
    "other-services.default.svc.cluster.local:80",
    "other-services",
    "other-services:80",
    "other-services.default.svc.cluster",
    "other-services.default.svc.cluster:80",
    "other-services.default.svc",
    "other-services.default.svc:80",
    "other-services.default",
    "other-services.default:80",
    "100.69.233.237",
    "100.69.233.237:80"
  ],
  "routes": [
    {
      "match": {
        "prefix": "/"
      },
      "route": {
        "cluster": "outbound|80||other-services.default.svc.cluster.local",
        "timeout": "0s",
        "retry_policy": {
          "retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
          "num_retries": 2,
          "retry_host_predicate": [
            {
              "name": "envoy.retry_host_predicates.previous_hosts"
            }
          ],
          "host_selection_retry_max_attempts": "3",
          "retriable_status_codes": [
            503
          ]
        },
        "max_grpc_timeout": "0s"
      },
      "decorator": {
        "operation": "other-services.default.svc.cluster.local:80/*"
      },
      "per_filter_config": {
        "mixer": {
          "forward_attributes": {
            "attributes": {
              "destination.service.uid": {
                "string_value": "istio://default/services/other-services"
              },
              "destination.service.host": {
                "string_value": "other-services.default.svc.cluster.local"
              },
              "destination.service.namespace": {
                "string_value": "default"
              },
              "destination.service.name": {
                "string_value": "other-services"
              }
            }
          },
          "mixer_attributes": {
            "attributes": {
              "destination.service.namespace": {
                "string_value": "default"
              },
              "destination.service.name": {
                "string_value": "other-services"
              },
              "destination.service.host": {
                "string_value": "other-services.default.svc.cluster.local"
              },
              "destination.service.uid": {
                "string_value": "istio://default/services/other-services"
              }
            }
          },
          "disable_check_calls": true
        }
      }
    }
  ]
}

If anything it just looks like the default policy should retry all 503s, but it isn't in this scenario.

@arianmotamedi
Copy link
Contributor

Ah, it's retrying on connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes. 503 is included in 5xx and gateway-error, neither of which is here.

As an alternative, you can try setting maxRequestsPerConnection to 1, which will completely disable keep-alives. It should help reduce the number of 503s, but based on how much traffic you're getting, this could cause a noticeable performance drop (especially if calls to the node server are over tls) since every request will need to make a new connection to upstream.

@jaygorrell
Copy link
Contributor Author

I do understand why it's not captured by connect-failure now, but it should have been caught by retriable-status-codes since 503 is in the list (the only one in the list). See edit above with the second snippet to show what it looks like by default without an override, which is how things looked before my debugging today.

I'll revert it so the default policy kicks back in, but I already know it wasn't retying things either so it's still a little unclear why it's not handling the 503 in this specific case. I'll experiment with maxRequestsPerConnection as a temporary workaround if we can't get the retry to trigger.

@arianmotamedi
Copy link
Contributor

I missed your second snippet. Hmm yeah if it's really not retrying with retriable-status-codes it might be an actual bug.

@jaygorrell
Copy link
Contributor Author

jaygorrell commented Apr 11, 2019

Before removing my VS, I just tried again with 5xx added:

{
  "cluster": "outbound|80||my-service.default.svc.cluster.local",
  "timeout": "60s",
  "retry_policy": {
    "retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes,5xx",
    "num_retries": 2,
    "retry_host_predicate": [
      {
        "name": "envoy.retry_host_predicates.previous_hosts"
      }
    ],
    "host_selection_retry_max_attempts": "3"
  },
  "max_grpc_timeout": "60s"
}

And waited for the issue to happen:

[2019-04-11 17:44:53.700][24][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:93] [C614] using existing connection
[2019-04-11 17:44:53.700][24][debug][router] [external/envoy/source/common/router/router.cc:1128] [C617][S2648660195423251461] pool ready
[2019-04-11 17:44:53.700][24][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C614] remote close
[2019-04-11 17:44:53.700][24][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C614] closing socket: 0
[2019-04-11 17:44:53.700][24][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C614] disconnect. resetting 1 pending requests
[2019-04-11 17:44:53.700][24][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C614] request reset
[2019-04-11 17:44:53.700][24][debug][router] [external/envoy/source/common/router/router.cc:538] [C617][S2648660195423251461] upstream reset
[2019-04-11 17:44:53.700][24][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-04-11 17:44:53.700][24][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1235] [C617][S2648660195423251461] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '57'
'content-type', 'text/plain'
'date', 'Thu, 11 Apr 2019 17:44:53 GMT'
'server', 'istio-envoy'

So at least it doesn't seem to be a problem with retriable-status-codes, but maybe something about this specific scenario that doesn't get treated like just a 503 response? I'm just tossing out thoughts... I clearly have no idea here. :)

As a follow-up I'm still not 100% clear on what causes the potential for this issue. Is it just the fact that node has a keepalive-timeout that is relatively low (5s) and others (like nginx) are much higher and infrequently closed?

@jaygorrell
Copy link
Contributor Author

jaygorrell commented Apr 11, 2019

@arianmotamedi So far so good on the single request-per-connection solution. The peak time has already passed for today but no errors in the past 2 hours. I'll close this tomorrow if it continues to hold up but I'm still curious if anyone has more information on the above question.

I'm mostly surprised this was isolated to a couple of my 100 services, all of which run on Node (5s keepalive timeout) (strike that part -- got it on a python/tornado service last night).

@arianmotamedi
Copy link
Contributor

@jaygorrell is the issue still occurring even after setting maxRequestsPerConnection to 1?

@jaygorrell
Copy link
Contributor Author

Nope, I applied that fix to the service with 95% of the errors yesterday and it has only happened in the few other services that experience the problem. A few outstanding pieces to me:

  1. I have ~100 services with very similar configurations (they use a standard boilerplate) but this only happens in a few of them -- the ones that serve long (30s+) requests. What could explain that?
  2. Why is this situation not caught by the default retry (both retriable-status-codes with 503 and 5xx)?

For 1, I realize that's more related to the Envoy issue so I can close this without knowing the answer there. For 2, I can open a separate more direct issue if that makes more sense.

@arianmotamedi
Copy link
Contributor

For 1, I realize that's more related to the Envoy issue so I can close this without knowing the answer there.
For 2, I can open a separate more direct issue if that makes more sense.

I agree with both of these :)

@jaygorrell
Copy link
Contributor Author

Continuing the remainder of this topic in envoyproxy/envoy#6190 for now then. Thanks for all the help @arianmotamedi -- it also looks like 1.1.3 is tagged so we should have a release with the timeout setting pretty soon!

@jaygorrell
Copy link
Contributor Author

I just updated to 1.1.3 and removed the maxRequestsPerConnection: 1 setting with the intent of using the timeout setting ... but the issue has went away for some reason. I'm going to keep things as-is and see how it goes, but it appears something may have changed in 1.1.3 to help this anyway.

@arianmotamedi
Copy link
Contributor

It looks like others are seeing the same thing. Yeah let it run for a few days and see how it goes.

@jaygorrell
Copy link
Contributor Author

I thought that was too good to be true. The message changed from upstream connect error or disconnect/reset before headers to upstream connect error or disconnect/reset before headers. reset reason: connection termination so my search had just stopped working.

So here's where I'm at. maxRequestsPerConnection: 1 took us from about ~120 errors/day down to 10 errors/day. After going to 1.1.3 and dropping the maxRequestsPerConnection setting (confirmed inconfig_dump), I'm still getting just ~10/day. I assume all the pods/proxies don't need bounced anyway... am I mistaken there?

I'll try out the timeout setting to see if I can get to zero but I'm still a little confused at the improvement on 1.1.3.

@duderino
Copy link
Contributor

@jaygorrell do you have a rough idea on the peak concurrent connections you had (connections going into envoy and out) or peak requests per second? I wonder if you're now occasionally hitting a circuit breaker.

On the improvement seen in 1.1.3, which release did you upgrade from?

@jaygorrell
Copy link
Contributor Author

@duderino I don't have the environment in a similar state to confirm things, but I believe the leftover 503s after using the maxRequestsPerConnection and/or timeout tweaks are related to pods being torn down.

Whatever happens in that event isn't caught by the default Istio retry, either (envoyproxy/envoy#6726).

@jaygorrell
Copy link
Contributor Author

This is fixed in envoyproxy/envoy#7505. I think this issue can be closed once those changes are pulled into Istio.

@mar1n3r0
Copy link

mar1n3r0 commented Aug 6, 2019

This is fixed in envoyproxy/envoy#7505. I think this issue can be closed once those changes are pulled into Istio.

I am still experiencing the issue.
Are those changes synced now ?

@vaibhavjain882
Copy link

Thanks for the tips. I wouldn't say I can consistently repro it. It's common enough to annoy users but just infrequent enough to be difficult to catch. I'll give it a shot though.

I'm noticing now that no matter how hard I try when the cluster is virtually no traffic at night, it's very difficult to replicate. While during the day, I could replicate it by hitting a /ping endpoint that just responds with a 200 OK.

Something I'm just now realizing is that the services that have the problem are all services that can take quite long to respond. I'm wondering if after a request is kicked off that takes 40s to respond (and Istio times out at the 15s default), if there are any scenarios where subsequent requests would hit this issue? It was relatively common during business hours with the /ping endpoint, likely while other users were hitting the longer request endpoints. Just to be clear, these long-running requests are not a problem without going through Istio.

The 15s timeout on longer requests is a problem of it's own but that one is easy enough to solve.

I am facing the similar issue, few of my microservices are calling the internet routable 3rd party endpoints which generally responds in 20+ seconds. When I removed the istio from my cluster, all the requests give 200 Status code but after istio I am getting 504 Gateway timeouts and my requests are closed in 15 seconds in every case.

I am wondering if there is anyway to increase this envoy timeout value.?

Logs before and after istio:

without istio:

2019-09-06 19:42:20.113  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-4] c.s.t.s.impl.common.PrepareRequest       : Start Preparing search request
2019-09-06 19:42:20.117  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-4] c.s.t.s.impl.common.PrepareRequest       : Done Preparing search request
2019-09-06 19:42:42.162  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.TVSearchServiceImpl   : xxxx search response took 21.957 Seconds
2019-09-06 19:42:42.292  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.common.Transformer    : Doing transformation of supplier response into our response
2019-09-06 19:42:42.296  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.common.Transformer    : Transformer: Parsing completed in 3 mSeconds

After Istio:

2019-09-06 19:40:00.048  INFO [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-3] c.s.t.s.impl.common.PrepareRequest       : Start Preparing search request
2019-09-06 19:40:00.048  INFO [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-3] c.s.t.s.impl.common.PrepareRequest       : Done Preparing search request

2019-09-06 19:40:15.058  INFO [xx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx Search Request {"rqst":{"Request":{"__type":"xx","CheckIn":"/Date(1569628800000+0000)/","CheckOut":"/Date(1569801600000+0000)/","DetailLevel":9,"ExcludeHotelDetails":false,"GeoLocationInfo":{"Latitude":25.204849,"Longitude":55.270782},"Nights":0,"RadiusInMeters":25000,"Rooms":[{"AdultsCount":2,"KidsAges":[2]}],"DesiredResultCurrency":"EUR","Residency":"GB","TimeoutSeconds":25,"ClientIP":"127.0.0.1"},"RequestType":1,"TypeOfService":2,"Credentials":{"UserName":"xxxx","Password":"xx"}}}
2019-09-06 19:40:15.058 ERROR [xx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx Search request failed 504 GATEWAY_TIMEOUT

2019-09-06 19:40:15.058  INFO [hotel-tv-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xx search response took 15.009 Seconds

2019-09-06 19:40:15.059 ERROR [hotel-tv-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] a.w.r.e.AbstractErrorWebExceptionHandler : [79d38e2f] 500 Server Error for HTTP POST "/search/geo-location"

java.lang.RuntimeException: Error occurred, We did not receive proper search response from xx please check internal logs for more info

<Java Stack trace >

2019-09-06 19:40:15.061 ERROR [hotel-tv-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx search response upstream request timeout
2019-09-06 19:41:16.081  INFO [hotel-tv-adapter,,,] 1 --- [      Thread-22] o.s.s.concurrent.ThreadPoolTaskExecutor  : Shutting down ExecutorService 'executor'

Any clue how to fix this? Istio-1.2.2 and 1.2.5

@amir-hadi
Copy link

I think we are experiencing this with istio 1.4.4. If we don't set maxRequestsPerConnection to 1, we are getting 503s all the time. This feature still makes sense for us, so I'd like to keep it on, but with that many errors, our hands are tied.

@mikosa
Copy link

mikosa commented Jun 5, 2020

Make sure hpa for istio pilot is > 5
I had same error and noticed pilot scaled to 5. I increased autoscaling to 20.
Now Pilot stabilized to 7

@aaronbbrown
Copy link

Apologies for commenting on this long-closed issue, but I was also experiencing this problem. Because some of our traffic comes in via NodePorts directly to the service, there's no outbound rule that I can set for retries, so I ended up adding this EnvoyFilter, which retries upstream reset connections rather than expecting the client to do so:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: passthrough-retries
  namespace: myapp
spec:
  workloadSelector:
    labels:
      app: myapp
  configPatches:
  - applyTo: HTTP_ROUTE
    match:
      context: SIDECAR_INBOUND
      listener:
        portNumber: 8080
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
            subFilter:
              name: "envoy.filters.http.router"
    patch:
      operation: MERGE
      value:
        route:
          retry_policy:
            retry_back_off:
              base_interval: 10ms
            retry_on: reset
            num_retries: 2

After applying ☝🏻 to my service, the random 503/UC errors seem to have gone away regardless of the client or ingress method.

Hopefully this is helpful to someone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants