-
Notifications
You must be signed in to change notification settings - Fork 8k
0DPE (Downstream Protocol Error) caused by HTTP2 Filter #25218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I had a thought that maybe the DPE's (whatever caused those) in turn caused outlier detection to kick in, unfortunately we weren't scraping envoy metrics into prometheus so i don't have historical data, but one of the pods had:
For context there are 6 However that doesn't explain why we were getting request failures as we have a max ejection % of 50? It’s like all hosts were evicted and the requests just stalled resulting in a 503. I've raised #25219 which would be nice, ability to enable outlier logs on a pod level rather than globally |
Also it looks as if outlierDetection is not working as documented, we're getting evictions on 5xx codes despite explicit configuration to exclude them: #25220 |
@mandarjog could this be some sort of unexpected interaction with http2+outlier? @howardjohn and ideas? It was quite a nasty incident for us :( |
For posterity, 1.5.7 was a security release and nothing more. The only commits were pertaining to the CVEs https://istio.io/latest/news/releases/1.5.x/announcing-1.5.7/ As you can see from the commit log https://github.com/istio/istio/commits/1.5.7 since 1.5.6 (june 15th), the only change has been to update the proxy SHA. If this issue is continuing to happen, can you try 1.5.6 proxy image (I know it has some CVEs mentioned above. So try it in some safer setup). |
We've only had one occurrence of this so far (thankfully) which unfortunately means that rolling back won't give a timely, deterministic data point. If we see it pop again, that will be what we try next. |
@lambdai I have never seen a DPE before.. any ideas what can cause this? |
WRT |
Explained in #25220 that outlier detector sees a wider occurrence of 5xx. The metric shows outlier detector is triggered, likely the gateway error. |
"DPE" is new to me, too. I need to dig |
Hey I've provided you my config dumps and metrics - can you confirm if i'm perhaps being hit by that bug @lambdai please and if so what the impact is as it isn't clear to me? This is my stats config:
|
GHSA-8hf8-8gvw-ggvx is one of the vulnerability fix in 1.5.7 |
istio/envoy@ea2d62e#diff-0cde9df72a69112b1d9af9cc414a107bR315
Is there any luck this counter is captured by prometheus? |
Yes that metric is there but its 0:
It's worth noting btw we only really see this under heavy load. Right now (about 50% load) we're hardly getting any outlier ejections active. Unfortunately I was not able to enable the outlierLog because of another bug! #25269 I'm going to try rolling back to 1.5.6 to see what happens. @lambdai I still don't understand if what that stats bug you fixed does, or if i'm impacted by it? |
Update: same behaviour in 1.5.6 so that removes one variable |
@lambdai you never answered my question about the impact of the upstream bug you fixed and if it impacts us |
@lambdai ping... ^^ |
Sorry I lost this thread. |
@lambdai im really sorry but I don’t understand the bug. We have that metric enabled, so are we OK? The envoy PR says “Risk Level: Mid. User would see documented behavior when the ejection metric is disabled.” Which would seem to imply the inverse of what you just said (eg we need to turn the metric off in order to get the documented behaviour) |
The documented(and the expected behavior) is that ejection is setup upon outlier detection config, regardless you enable the metric. The buggy behavior was that you need both metric and outlier detection config to enable the percentage accounting. So an alternative approach is to enable the metric. |
@Stono My understanding is that you enable the metrics there but still too many endpoints are injected? That's probably another bug but I wish you can confirm |
When you say “injected” do you mean “ejected”? If so yes, we were getting too many ejections but that has been fixed in https://t.co/YaN2TLfB9q, we are just waiting for the release. We still haven’t gotten to the bottom of this bug however and the DPE errors |
@howardjohn @mandarjog we've discovered this is because of http2! What you see is:
The last time we had this problem, restarting the workload seemed to sort it. However this time we restarted both Due to it saying It's worth noting, we use http2 for other high throughput services however we have only ever seen this issue between The only thing in the nginx proxy logs was:
And the only thing in the consumer-gateway logs was:
we're on 1.5 so http2 is enabled via an
|
We see a few yes @mandarjog :
|
http2/codec_impl.cc:681 - invalid frame: Invalid HTTP header field was received on stream 8795 suggests that an http1 header is being encoded incorrectly by the client side envoy codec. |
To sort it out: ingress-nginx app -> ingress-nginx-sidecar: unknown http protocol, nginx decides. |
@PiotrSikora now so we know the response headers are the problem and not the request headers from gw envoy to consumer Envoy ? |
You're right, it even says Downstream Protocol Error... That's what I get for responding before I'm fully awake. In any case, running a debug build would tell us exactly what's triggering this error... Also, it's possible that |
@Stono You are right. I cannot reproduce it. It's great you can confirm there is |
@PiotrSikora the only way to get DPE at the |
No... @lambdai that doesn’t make any sense. The sidecar is killing the connection way before the consumer gateway sends a response so it can’t be anything to do with the response headers The broken pipe in the app is a symptom of the sidecar dropping the connection which in turn a symptom of the DPE |
No unfortunately I cannot run a debug build for this. It’s manifested twice in production in three weeks on a service doing 1000+ ops/second. It’s simply not feasible to leave debug logs running in production for that period of time. Happy to run a build with more logging specifically around the DPE code however |
Just want to clarify that sometimes |
I agree it's envoy close the connection. What I don't fully sure is the latter part: I suspect envoy received the headers but think the headers are corrupted. I saw this in envoy before but I don't know if your case is so |
No I promise you 100% consumer gateway has not written any headers. The DPE happens way way before the app attempts to write the response, that is clearly visible in the waterfall trace I shared above. |
Also, even if there was invalid headers, that doesn’t explain why we only have the problem with the http2 enabled between sidecars. In both cases, http1 and http2, the connection between consumer gateway sidecar and app is http1.1 |
I trust you, but I am wondering how could the gateway app attempt to write anything to gateway sidecar. It's http1, a strict request-response model. |
@lambdai perhaps we're getting lost in translation here... I'm saying consumer-gateway app is writing the response back, but that fails with a broken pipe exception as it's trying to write a response on a connection that has been closed by the source. This is how java manifests a downstream (in this case the consumer-gateway sidecar) disconnection. Take this trace where consumer-gateway cannot return anything until sauron-web has responded (as consumer-gateway is a simple java zuul proxy): (note: there's actually a lot more under sauron-web, i've just trimmed it in the screenshot) This is what I believe is happening: The issue can't be in the request headers, as the request makes it to consumer gateway (as it starts making its downstream requests). It's something happening in between the request getting to the app, and before the app returns response data. |
@howardjohn @mandarjog @lambdai we recently tried turning on http2 again on 1.7 (this issue was raised on 1.5) and can confirm after a few days (3) we started to see the same failures, and backed it out. I don't really have any more information other than what has been shared on this thread. The extremely difficult thing here it is takes days to start manifesting and when it does it's only on a service that receives very high throughput - so capturing istio-proxy logs can be difficult. |
In fact one other piece of information I was able to glean from our edge logs was that the issues affected all different services routing through consumer-gateway. So as a refresher, |
As part of #31136 i've turned http2 on for one of our services again. This time we're on 1.8.3 |
@howardjohn as you saw in slack we decided to make http2 the default again. You can see from the trace it's effectively the first "hop" that fails: So whatever this issue is, it's still present |
Interestingly, it's the same service as last year ( |
What are the response flag details ? |
What do you mean sorry? Is there some other info I can get for you?
Yes that's the correct flow
Yeah but why would that happen? And also, why would it just degrade without any releases or code changes, and be immediately fixed by toggling http2 on and off! This happened at 6am one morning, http2 had been enabled on |
https://istio.io/latest/docs/tasks/observability/logs/access-log/#default-access-log-format Look for %RESPONSE_CODE_DETAILS% It is possible that this is data dependent. So some specific headers may bring out this behaviour in h2 processing . @lambdai is there a way to dump full request contents when a DPE is detected? We may need a special istio-proxy build. |
So we don't have the access log enabled by default as its noisy and gives us little over what Jaeger does. |
We have the exact same issue. Any way to log ingress request headers? This is the error in the logs: |
Uh oh!
There was an error while loading. Please reload this page.
Bug description
I don't even know where to start with this one... I have no idea what went wrong but I'm going to share what info I do have here in case it gets anyones attention because it was quite a significant failure which in the end was recovered by restarting the source workload (
nginx
).I know correlation != causation, but we've been stable on 1.5.0-1.5.6 for some time, without issue, and upgraded to 1.5.7 4 days ago.
Symptoms:
We started to see waves of request fail between two specific workloads on the cluster (all other workloads were fine). The source workload being
nginx
and the destination workload beingconsumer-gateway
.The errors came in waves of decreasing duration, you can see the scatter plot from Jaeger here how you get a spike, they gradually fail faster before eventually recovering.
There was no correlation on the source workload:

The errors primarily manifested in two ways:
1. 503UC from the source sidecar
This represented the majority of errors recorded, a 503 was recorded from the source (
nginx
) sidecar. There was no span recorded from the destination (consumer-gateway
) leading me to believe the request was never even sent.2. 0DPE
There was also a fair few of these "Downstream Protocol Errors" recorded by the destination (
consumer-gateway
) sidecar. These are interesting because it shows the request reachedconsumer-gateway
So far I've checked the following:
Both workloads have been running for 4 days, since we upgraded to
1.5.7
(which is actually the only thing which has changed recently).It's worth nothing we use both mTLS and http2 (via EnvoyFilter):
However neither of these things have changed in way over 6 months
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[x] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
Not to break :-)
Steps to reproduce the bug
I've not managed to reproduce this
Version (include the output of
istioctl version --remote
andkubectl version
andhelm version
if you used Helm)1.5.7
How was Istio installed?
Helm
Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.15
The text was updated successfully, but these errors were encountered: