New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
503 "upstream connect error or disconnect/reset before headers" in 1.1 with low traffic #13205
Comments
Unfortunately this is not true - 503s cannot be automatically retried due to the risky nature of non-idempotent requests. You can define a virtual service and use route-level retries if you're confident multiple identical requests made to the server won't have any adverse effects, but this is not something Envoy would have knowledge of on its own. Ability to define idle timeout values for upstream connections is currently waiting to be merged in 1.1: #13146. Once it makes it in you can play around with it to see if it helps with the 503s. Also see #9113 for a similar discussion around this issue. |
Thanks @arianmotamedi. I've been all over those issues and that's how I was able to get as far as I am now. The default retry was added for: Does that not capture the "upstream connect error or disconnect/reset before headers" issue? I'd be happy to add a custom one, but I thought Also, where exactly is the idle timeout happening? I still have an understanding gap there. Basically, what is the default and what side is timing out early? I'm guessing the node services are closing connections but I'm not sure how long they're attempted to be used for anyway. |
Did you add this yourself? For these to work you'll need to pass the In most cases, these 503s seem to happen when an upstream service closes a connection but Envoy still thinks the connection is reusable and puts it back in the coonection pool. Idle timeout was implemented to address this issue: you set it in your cluster definition (it'll be a To see what's causing the 503s in your specific case you'll have to run a tcpdump and look at captured tcp traffic, but based on the symptoms you described (it occurring after being idle for some time) it's likely it's a keepalive issue and not a connect failure case. |
Hmm, I'm referring to #10566, which was in 1.1.0. Are you saying I'm still missing something required for that to kick in? My understanding is that it applies to all requests by default and handles 503s caused by |
Oh interesting, I was not aware of changes that made it into 1.1. In any case, there's no indication here that the 503s are happening because of a connect failure vs upstream reset. You won't know that for sure unless you do a packet capture. Also check sidecar proxy's config and see what this retry policy is actually being translated to. |
Considering you're able to consistently repro your 503, I would personally turn on sidecar debug logs on both your source and destination by port-fowrarding 15000 and doing The logs are super verbose but should give some insight. |
Thanks for the tips. I wouldn't say I can consistently repro it. It's common enough to annoy users but just infrequent enough to be difficult to catch. I'll give it a shot though. I'm noticing now that no matter how hard I try when the cluster is virtually no traffic at night, it's very difficult to replicate. While during the day, I could replicate it by hitting a Something I'm just now realizing is that the services that have the problem are all services that can take quite long to respond. I'm wondering if after a request is kicked off that takes 40s to respond (and Istio times out at the 15s default), if there are any scenarios where subsequent requests would hit this issue? It was relatively common during business hours with the The 15s timeout on longer requests is a problem of it's own but that one is easy enough to solve. |
A timeout would give you a 504, not 503, so I highly doubt that it's related. But as Stono suggested, you can turn on debug logs and to see if anything stands out. |
The 504 would be for the request that actually timed out. I was referring to subsequent requests. I'll be trying to capture logs from the issue today with traffic ramping back up. |
Alright, I was able to capture some 503s with verbose logging on (Thanks @Stono!). I have no idea which pieces are really helpful though so here's a relevant section if it helps.
There is much more but it's very verbose -- the big list of attributes and I assume this bit is the key part:
Does "resetting 1 pending requests" mean the retry did attempt to go through? |
There it is:
It's using an existing connection from the pool, so not a connect issue. This looks exactly like the case where upstream has closed a connection but Envoy is still keeping it around in the pool. Highly likely it's related to envoyproxy/envoy#6190, which is a known bug. You should be able to use idle_timeout to alleviate most of these 503s. It was merged into the 1.1 branch, so should be available on the next release (1.1.3?)
No, that's just what happens when there is a local or remote close event: https://github.com/envoyproxy/envoy/blob/cbf03b90e8aa1c3476b493d8629abe7fc82b22c9/source/common/http/codec_client.cc#L79. I don't see anything in the logs that the request was retried - can you dump sidecar proxy's config? |
I see, thank you! I guess my next question is if there is any mitigation I can do on the server side? Does envoy have a default idle timeout for the pool that I would need the server to exceed? It's node, which defaults to 5s. Also I'm still unclear why the default retry in 1.1 didn't mitigate the issue. (edit: just saw your edit - ignore this... getting the dump) |
No and that's the issue - there is no idle timeout by default. Changes in #13146 will allow you to specify one though, so as soon as it's released you should be able to either set this to < 5s for services that call node or if this is a production environment and performance matters set to like 15s and have your node service keep the connection open for 30s.
This is why I asked if you could get a config dump of the sidecar to see if Pilot is passing the correct route retry configuration to Envoy. |
@arianmotamedi Is there a specific section I should look for? It's 35k lines but I'll dump it all if that's useful. |
I think I found it:
Though this is an override specifically for the service having the issue, just in case the default policy wasn't working. edit: Here's the retry policy applied to all other services that I have not added a VirtualService for -- this is what would have applied to this one yesterday before I started trying to debug.
If anything it just looks like the default policy should retry all 503s, but it isn't in this scenario. |
Ah, it's retrying on As an alternative, you can try setting maxRequestsPerConnection to 1, which will completely disable keep-alives. It should help reduce the number of 503s, but based on how much traffic you're getting, this could cause a noticeable performance drop (especially if calls to the node server are over tls) since every request will need to make a new connection to upstream. |
I do understand why it's not captured by I'll revert it so the default policy kicks back in, but I already know it wasn't retying things either so it's still a little unclear why it's not handling the 503 in this specific case. I'll experiment with |
I missed your second snippet. Hmm yeah if it's really not retrying with |
Before removing my VS, I just tried again with
And waited for the issue to happen:
So at least it doesn't seem to be a problem with As a follow-up I'm still not 100% clear on what causes the potential for this issue. Is it just the fact that node has a keepalive-timeout that is relatively low (5s) and others (like nginx) are much higher and infrequently closed? |
@arianmotamedi So far so good on the single request-per-connection solution. The peak time has already passed for today but no errors in the past 2 hours. I'll close this tomorrow if it continues to hold up but I'm still curious if anyone has more information on the above question. I'm mostly surprised this was isolated to a couple of my 100 services, |
@jaygorrell is the issue still occurring even after setting maxRequestsPerConnection to 1? |
Nope, I applied that fix to the service with 95% of the errors yesterday and it has only happened in the few other services that experience the problem. A few outstanding pieces to me:
For 1, I realize that's more related to the Envoy issue so I can close this without knowing the answer there. For 2, I can open a separate more direct issue if that makes more sense. |
I agree with both of these :) |
Continuing the remainder of this topic in envoyproxy/envoy#6190 for now then. Thanks for all the help @arianmotamedi -- it also looks like 1.1.3 is tagged so we should have a release with the timeout setting pretty soon! |
I just updated to 1.1.3 and removed the |
It looks like others are seeing the same thing. Yeah let it run for a few days and see how it goes. |
I thought that was too good to be true. The message changed from So here's where I'm at. I'll try out the timeout setting to see if I can get to zero but I'm still a little confused at the improvement on 1.1.3. |
@jaygorrell do you have a rough idea on the peak concurrent connections you had (connections going into envoy and out) or peak requests per second? I wonder if you're now occasionally hitting a circuit breaker. On the improvement seen in 1.1.3, which release did you upgrade from? |
@duderino I don't have the environment in a similar state to confirm things, but I believe the leftover 503s after using the Whatever happens in that event isn't caught by the default Istio retry, either (envoyproxy/envoy#6726). |
This is fixed in envoyproxy/envoy#7505. I think this issue can be closed once those changes are pulled into Istio. |
I am still experiencing the issue. |
I am facing the similar issue, few of my microservices are calling the internet routable 3rd party endpoints which generally responds in 20+ seconds. When I removed the istio from my cluster, all the requests give 200 Status code but after istio I am getting 504 Gateway timeouts and my requests are closed in 15 seconds in every case. I am wondering if there is anyway to increase this envoy timeout value.? Logs before and after istio: without istio:
After Istio:
Any clue how to fix this? Istio-1.2.2 and 1.2.5 |
I think we are experiencing this with istio 1.4.4. If we don't set |
Make sure hpa for istio pilot is > 5 |
Apologies for commenting on this long-closed issue, but I was also experiencing this problem. Because some of our traffic comes in via NodePorts directly to the service, there's no outbound rule that I can set for retries, so I ended up adding this EnvoyFilter, which retries upstream apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: passthrough-retries
namespace: myapp
spec:
workloadSelector:
labels:
app: myapp
configPatches:
- applyTo: HTTP_ROUTE
match:
context: SIDECAR_INBOUND
listener:
portNumber: 8080
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
subFilter:
name: "envoy.filters.http.router"
patch:
operation: MERGE
value:
route:
retry_policy:
retry_back_off:
base_interval: 10ms
retry_on: reset
num_retries: 2 After applying ☝🏻 to my service, the random 503/UC errors seem to have gone away regardless of the client or ingress method. Hopefully this is helpful to someone! |
Describe the bug
Many services in our environment are experiencing a handful (~1% or so) of 503 errors with the response "upstream connect error or disconnect/reset before headers". I have rolled istio out to two environments and they both experience similar issues.
These environments do not have any Istio gateways -- it's just service to service communication; however, I did replicate the issue using curl through k8s ingress (outside mesh) to a mesh service so I assume it's isolated to the destination's sidecar, which returns the 503.
Digging through some older issues, it appears this is related to a keepalive timeout, which does fit my experience. It's usually my first curl after a long break that fails, if that makes sense with anything. In 1.1, there was a default retry added that I believe was supposed to handle this case but it doesn't appear to be kicking in for me.
Is there anything that needs to be enabled for the default retry to take effect? Policy checks, for example. I also don't have Virtual Services or Destination Rules added yet but I was told those are needed for it either. I did do a quick test by adding one and it didn't help.
Here's a curl example (through k8s ingress) that I tried this morning after being Idle all night:
Expected behavior
503s with "upstream connect error or disconnect/reset before headers" are retried by default.
Steps to reproduce the bug
Not completely sure, but some key information:
Version
Istio 1.1.2
K8S 1.11.8
Installation
Helm template / kubectl apply; nothing of note in the helm value overrides -- just setting a minimum of 2 for istio pods, rewriting http probes, enabling mtls (permissive), and enabling grafana/kiali.
Environment
AWS / Kops
The text was updated successfully, but these errors were encountered: