Skip to content

0DPE (Downstream Protocol Error) caused by HTTP2 Filter #25218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Stono opened this issue Jul 5, 2020 · 53 comments
Open

0DPE (Downstream Protocol Error) caused by HTTP2 Filter #25218

Stono opened this issue Jul 5, 2020 · 53 comments
Labels
area/networking lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@Stono
Copy link
Contributor

Stono commented Jul 5, 2020

Bug description
I don't even know where to start with this one... I have no idea what went wrong but I'm going to share what info I do have here in case it gets anyones attention because it was quite a significant failure which in the end was recovered by restarting the source workload (nginx).

I know correlation != causation, but we've been stable on 1.5.0-1.5.6 for some time, without issue, and upgraded to 1.5.7 4 days ago.

Symptoms:
We started to see waves of request fail between two specific workloads on the cluster (all other workloads were fine). The source workload being nginx and the destination workload being consumer-gateway.

Screenshot 2020-07-05 at 19 16 21

The errors came in waves of decreasing duration, you can see the scatter plot from Jaeger here how you get a spike, they gradually fail faster before eventually recovering.

image

There was no correlation on the source workload:
Screenshot 2020-07-05 at 19 35 31

The errors primarily manifested in two ways:

1. 503UC from the source sidecar
This represented the majority of errors recorded, a 503 was recorded from the source (nginx) sidecar. There was no span recorded from the destination (consumer-gateway) leading me to believe the request was never even sent.

Screenshot 2020-07-05 at 18 47 57

2. 0DPE
There was also a fair few of these "Downstream Protocol Errors" recorded by the destination (consumer-gateway) sidecar. These are interesting because it shows the request reached consumer-gateway

Screenshot 2020-07-05 at 19 16 57

So far I've checked the following:

  • There was nothing in pilot logs, nothing in the istio-proxy logs (although we run those in warn).
  • There was no correlation from an infrastructure perspective (we observed failures across several nodes, across all zones). The only correlation was the fact the issue was between just these two workloads.
  • We have rafts of network testing which indicated no issue.
  • I wondered if this could be something odd to do with mtls cert rotation, but doesn't look like the certificates on either have changed recently:
❯ k -n consumer-gateway get secret istio.default -o json | jq -r '.data["cert-chain.pem"]' | base64 --decode | openssl x509 -noout -text | grep Before
            Not Before: Jun 18 17:57:06 2020 GMT

❯ k -n ingress-nginx get secret istio.default -o json | jq -r '.data["cert-chain.pem"]' | base64 --decode | openssl x509 -noout -text | grep Before
            Not Before: Jun 18 17:53:33 2020 GMT

Both workloads have been running for 4 days, since we upgraded to 1.5.7 (which is actually the only thing which has changed recently).

It's worth nothing we use both mTLS and http2 (via EnvoyFilter):

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: consumer-gateway
  namespace: consumer-gateway
spec:
  host: app
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 2147483646
        http2MaxRequests: 2147483646
        idleTimeout: 30m
        maxRequestsPerConnection: 2147483646
        maxRetries: 2147483646
      tcp:
        maxConnections: 2147483646
        tcpKeepalive:
          interval: 30s
          time: 30s
    loadBalancer:
      simple: RANDOM
    outlierDetection:
      baseEjectionTime: 2m
      consecutiveErrors: 2
      interval: 5s
      maxEjectionPercent: 50
      minHealthPercent: 50
    tls:
      mode: ISTIO_MUTUAL
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: h2-consumer-gateway
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      cluster:
        service: app.consumer-gateway.svc.cluster.local
      context: SIDECAR_OUTBOUND
    patch:
      operation: MERGE
      value:
        http2_protocol_options:
          max_concurrent_streams: 47707

However neither of these things have changed in way over 6 months

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[x] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Not to break :-)

Steps to reproduce the bug
I've not managed to reproduce this

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)
1.5.7

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.15

@Stono
Copy link
Contributor Author

Stono commented Jul 5, 2020

I had a thought that maybe the DPE's (whatever caused those) in turn caused outlier detection to kick in, unfortunately we weren't scraping envoy metrics into prometheus so i don't have historical data, but one of the pods had:

❯ cat ~/Desktop/metrics|grep outlier_dete | grep consumer
envoy_cluster_outlier_detection_ejections_enforced_total{cluster_name="outbound|80||app.consumer-gateway.svc.cluster.local"} 112
envoy_cluster_outlier_detection_ejections_active{cluster_name="outbound|80||app.consumer-gateway.svc.cluster.local"} 3

For context there are 6 nginx and 6 consumer-gateway pods.

However that doesn't explain why we were getting request failures as we have a max ejection % of 50? It’s like all hosts were evicted and the requests just stalled resulting in a 503.

I've raised #25219 which would be nice, ability to enable outlier logs on a pod level rather than globally

@Stono
Copy link
Contributor Author

Stono commented Jul 5, 2020

Also it looks as if outlierDetection is not working as documented, we're getting evictions on 5xx codes despite explicit configuration to exclude them: #25220

@Stono
Copy link
Contributor Author

Stono commented Jul 5, 2020

@mandarjog could this be some sort of unexpected interaction with http2+outlier?

@howardjohn and ideas? It was quite a nasty incident for us :(

@rshriram
Copy link
Member

rshriram commented Jul 6, 2020

For posterity, 1.5.7 was a security release and nothing more. The only commits were pertaining to the CVEs https://istio.io/latest/news/releases/1.5.x/announcing-1.5.7/

As you can see from the commit log https://github.com/istio/istio/commits/1.5.7 since 1.5.6 (june 15th), the only change has been to update the proxy SHA. If this issue is continuing to happen, can you try 1.5.6 proxy image (I know it has some CVEs mentioned above. So try it in some safer setup).

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

We've only had one occurrence of this so far (thankfully) which unfortunately means that rolling back won't give a timely, deterministic data point. If we see it pop again, that will be what we try next.

@howardjohn
Copy link
Member

@lambdai I have never seen a DPE before.. any ideas what can cause this?

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

WRT maxEjectionPercent not working as expected: I recently fixed a bug in upstream envoy but not back ported. That config is activated only if the stats is enabled until the fix. Could you double check if the stats is enabled at your istio? That cluster stats is NOT enabled by default
envoyproxy/envoy#11629

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

Explained in #25220 that outlier detector sees a wider occurrence of 5xx. The metric shows outlier detector is triggered, likely the gateway error.

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

"DPE" is new to me, too. I need to dig

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

WRT maxEjectionPercent not working as expected: I recently fixed a bug in upstream envoy but not back ported. That config is activated only if the stats is enabled until the fix. Could you double check if the stats is enabled at your istio? That cluster stats is NOT enabled by default

Hey I've provided you my config dumps and metrics - can you confirm if i'm perhaps being hit by that bug @lambdai please and if so what the impact is as it isn't clear to me?

This is my stats config:

sidecar.istio.io/statsInclusionSuffixes: outlier_detection.ejections_active,outlier_detection.ejections_overflow,outlier_detection.ejections_enforced_total,upstream_rq_total,upstream_cx_active,upstream_cx_total

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

GHSA-8hf8-8gvw-ggvx is one of the vulnerability fix in 1.5.7
I am not fully sure your case falls into this bucket, but envoy is more aggressive in resetting streams and dest envoy says DPE (due to flood detection). Meanwhile the source envoy would interpret as 5XX(upstream failure)

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

istio/envoy@ea2d62e#diff-0cde9df72a69112b1d9af9cc414a107bR315

If the timeout is hit in this case, the :ref:`tx_flush_timeout <config_http_conn_man_stats_per_codec>` counter will be incremented.

Is there any luck this counter is captured by prometheus?

@Stono
Copy link
Contributor Author

Stono commented Jul 7, 2020

Yes that metric is there but its 0:

# TYPE envoy_cluster_http2_tx_flush_timeout counter
envoy_cluster_http2_tx_flush_timeout{cluster_name="xds-grpc"} 0

It's worth noting btw we only really see this under heavy load. Right now (about 50% load) we're hardly getting any outlier ejections active.

Unfortunately I was not able to enable the outlierLog because of another bug! #25269

I'm going to try rolling back to 1.5.6 to see what happens.

@lambdai I still don't understand if what that stats bug you fixed does, or if i'm impacted by it?

@Stono
Copy link
Contributor Author

Stono commented Jul 7, 2020

Update: same behaviour in 1.5.6 so that removes one variable

@Stono
Copy link
Contributor Author

Stono commented Jul 14, 2020

We're still seeing this extremely bad behaviour. It isn't resulting in failures but it is resulting in periods of high latency.

I've hacked the injector configuration to make outlier logs work and we are rolling that out today.

Screenshot 2020-07-14 at 09 43 43

@Stono
Copy link
Contributor Author

Stono commented Jul 14, 2020

@lambdai you never answered my question about the impact of the upstream bug you fixed and if it impacts us

@Stono
Copy link
Contributor Author

Stono commented Jul 24, 2020

@lambdai ping... ^^

@lambdai
Copy link
Contributor

lambdai commented Jul 24, 2020

Sorry I lost this thread.
The bug is envoyproxy/envoy#11629
If you have enabled ejections_active_ metric already, the fix is not helpful.

@Stono
Copy link
Contributor Author

Stono commented Jul 24, 2020

@lambdai im really sorry but I don’t understand the bug.

We have that metric enabled, so are we OK?

The envoy PR says “Risk Level: Mid. User would see documented behavior when the ejection metric is disabled.”

Which would seem to imply the inverse of what you just said (eg we need to turn the metric off in order to get the documented behaviour)

@lambdai
Copy link
Contributor

lambdai commented Jul 24, 2020

The documented(and the expected behavior) is that ejection is setup upon outlier detection config, regardless you enable the metric. The buggy behavior was that you need both metric and outlier detection config to enable the percentage accounting.

So an alternative approach is to enable the metric.

@lambdai
Copy link
Contributor

lambdai commented Jul 24, 2020

@Stono My understanding is that you enable the metrics there but still too many endpoints are injected?

That's probably another bug but I wish you can confirm

@Stono
Copy link
Contributor Author

Stono commented Jul 25, 2020

When you say “injected” do you mean “ejected”?

If so yes, we were getting too many ejections but that has been fixed in https://t.co/YaN2TLfB9q, we are just waiting for the release.

We still haven’t gotten to the bottom of this bug however and the DPE errors

@Stono
Copy link
Contributor Author

Stono commented Jul 29, 2020

@howardjohn @mandarjog we've discovered this is because of http2!

What you see is:

ingress-nginx -> consumer-gateway:
consumer-gateway kills the connection and reports 0DPE (Downstream Protocol Error) ingress-nginreports503UC` (Upstream connection terminated).

The last time we had this problem, restarting the workload seemed to sort it. However this time we restarted both consumer-gateway and ingress-nginx, and the errors persisted.

Due to it saying DPE, I gambled with disabling http2 between these services and sure enough, the errors went away. You can see this brief spike at the end where I attempted to enable http2 again:

Screenshot 2020-07-29 at 17 49 04

It's worth noting, we use http2 for other high throughput services however we have only ever seen this issue between ingress-nginx -> consumer-gateway.

The only thing in the nginx proxy logs was:

[Envoy (Epoch 0)] [2020-07-29 15:54:29.557][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:732] [C3915148][S4900880701147995753] request headers complete (end_stream=true):
[Envoy (Epoch 0)] [2020-07-29 15:54:29.557][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1280] [C3915148][S4900880701147995753] request end stream
[Envoy (Epoch 0)] [2020-07-29 15:54:29.557][27][debug][router] [external/envoy/source/common/router/router.cc:474] [C3915148][S4900880701147995753] cluster 'outbound|80||app.consumer-gateway.svc.cluster.local' match for URL '/json/bannerData?pageId=search,listings&channel=cars'
[Envoy (Epoch 0)] [2020-07-29 15:54:29.557][27][debug][router] [external/envoy/source/common/router/router.cc:614] [C3915148][S4900880701147995753] router decoding headers:
[Envoy (Epoch 0)] [2020-07-29 15:54:29.557][27][debug][router] [external/envoy/source/common/router/router.cc:1711] [C3915148][S4900880701147995753] pool ready
[Envoy (Epoch 0)] [2020-07-29 15:54:29.625][27][debug][router] [external/envoy/source/common/router/router.cc:990] [C3915148][S4900880701147995753] upstream reset: reset reason connection termination
[Envoy (Epoch 0)] [2020-07-29 15:54:29.625][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1422] [C3915148][S4900880701147995753] Sending local reply with details upstream_reset_before_response_started{connection termination}
[Envoy (Epoch 0)] [2020-07-29 15:54:29.625][27][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1620] [C3915148][S4900880701147995753] encoding headers via codec (end_stream=false):

And the only thing in the consumer-gateway logs was:

[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][client] [external/envoy/source/common/http/codec_client.cc:104] [C2240] response complete
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:774] [C2081] stream closed: 0
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][filter] [src/envoy/http/authn/http_filter.cc:49] Called AuthenticationFilter : onDestroy
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:201] [C2240] response complete
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:206] [C2240] saw upstream close connection
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C2240] closing data_to_write=0 type=1
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C2240] closing socket: 1
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][client] [external/envoy/source/common/http/codec_client.cc:91] [C2240] disconnect. resetting 0 pending requests
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:136] [C2240] client disconnected, failure reason:
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][client] [external/envoy/source/common/http/codec_client.cc:104] [C2255] response complete
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:538] [C2240] remote close
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:774] [C1987] stream closed: 0
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][filter] [src/envoy/http/authn/http_filter.cc:49] Called AuthenticationFilter : onDestroy
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:201] [C2255] response complete
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:239] [C2255] moving to ready
[Envoy (Epoch 0)] [2020-07-29 16:00:53.666][31][debug][client] [external/envoy/source/common/http/codec_client.cc:104] [C138] response complete

we're on 1.5 so http2 is enabled via an EnvoyFilter:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  annotations:
  name: h2-consumer-gateway
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      cluster:
        service: app.consumer-gateway.svc.cluster.local
      context: SIDECAR_OUTBOUND
    patch:
      operation: MERGE
      value:
        http2_protocol_options:
          max_concurrent_streams: 47707

@Stono Stono changed the title Variety of failures manifesting as 503UC and 0DPE 0DPE (Downstream Protocol Error) caused by HTTP2 Filter Jul 29, 2020
@mandarjog
Copy link
Contributor

Thanks for isolating it @Stono

@lambdai I expect to see

ENVOY_CONN_LOG(debug, "dispatch error: {}", read_callbacks_->connection(), error);

from conn_manager_impl.cc:279

Is connection debug log not enabled?

@Stono
Copy link
Contributor Author

Stono commented Jul 29, 2020

We see a few yes @mandarjog :

[Envoy (Epoch 0)] [2020-07-29 16:00:56.789][30][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:774] [C1860] stream closed: 0
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:263] [C1860] new stream
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:681] [C1860] invalid frame: Invalid HTTP header field was received on stream 8795
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:279] [C1860] dispatch error: The user callback function failed
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1859] [C1860][S2683156078241257520] stream reset
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C1860] closing data_to_write=0 type=2
[Envoy (Epoch 0)] [2020-07-29 16:00:56.805][30][debug][connection] [external/envoy/source/common/network/connection_impl_base.cc:30] [C1860] setting delayed close timer with timeout 1000 ms

@mandarjog
Copy link
Contributor

@PiotrSikora @lambdai

http2/codec_impl.cc:681 - invalid frame: Invalid HTTP header field was received on stream 8795

suggests that an http1 header is being encoded incorrectly by the client side envoy codec.

@lambdai
Copy link
Contributor

lambdai commented Jul 29, 2020

To sort it out:

ingress-nginx app -> ingress-nginx-sidecar: unknown http protocol, nginx decides.
ingress-nginx-sidecar -> consumer-gateway sidecar : MUST be http2
consumer-gateway sidecar -> consumer-gateway app: unknown http protocol, the port name of the k8s service selecting consumer-gateway pod decides It should be http1. See below comment

@mandarjog
Copy link
Contributor

@PiotrSikora now so we know the response headers are the problem and not the request headers from gw envoy to consumer Envoy ?

@PiotrSikora
Copy link
Contributor

@PiotrSikora now so we know the response headers are the problem and not the request headers from gw envoy to consumer Envoy ?

You're right, it even says Downstream Protocol Error... That's what I get for responding before I'm fully awake.

In any case, running a debug build would tell us exactly what's triggering this error... Also, it's possible that %RESPONSE_CODE_DETAILS% has more details, but I think it might just return http2.invalid.header.field.

@lambdai
Copy link
Contributor

lambdai commented Jul 30, 2020

@Stono You are right. I cannot reproduce it. It's great you can confirm there is BROKEN PIPE at Consumer gateway app! Now I am almost sure it's between consumer-gateway sidecar and Consumer gateway app.
Since Consumer gateway app is trying to write, it's likely the HTTP response(including header and body) is not passing the validation from envoy

@mandarjog
Copy link
Contributor

@PiotrSikora the only way to get DPE at the inbound of the sidecar is that the outbound from the ingress-gw encoded something incorrectly. Downstream of ingress-gw is http1.1.
Previously we have had utf8, or some magic stuff in the trace headers.

@Stono
Copy link
Contributor Author

Stono commented Jul 30, 2020

No... @lambdai that doesn’t make any sense. The sidecar is killing the connection way before the consumer gateway sends a response so it can’t be anything to do with the response headers

The broken pipe in the app is a symptom of the sidecar dropping the connection which in turn a symptom of the DPE

@Stono
Copy link
Contributor Author

Stono commented Jul 30, 2020

No unfortunately I cannot run a debug build for this. It’s manifested twice in production in three weeks on a service doing 1000+ ops/second.

It’s simply not feasible to leave debug logs running in production for that period of time.

Happy to run a build with more logging specifically around the DPE code however

@lambdai
Copy link
Contributor

lambdai commented Jul 30, 2020

Just want to clarify that sometimes DPE is wider than *downstream". IIRC duplicated "chunked" is also reporting DPE.

@lambdai
Copy link
Contributor

lambdai commented Jul 30, 2020

No... @lambdai that doesn’t make any sense. The sidecar is killing the connection way before the consumer gateway sends a response so it can’t be anything to do with the response headers

The broken pipe in the app is a symptom of the sidecar dropping the connection which in turn a symptom of the DPE

I agree it's envoy close the connection. What I don't fully sure is the latter part: before the consumer gateway sends a response.
We know the BROKEN PIPE is consumer-gateway-app trying to write. The write could be header or body.

I suspect envoy received the headers but think the headers are corrupted. I saw this in envoy before but I don't know if your case is so

@Stono
Copy link
Contributor Author

Stono commented Jul 30, 2020

No I promise you 100% consumer gateway has not written any headers.

The DPE happens way way before the app attempts to write the response, that is clearly visible in the waterfall trace I shared above.

@Stono
Copy link
Contributor Author

Stono commented Jul 30, 2020

Also, even if there was invalid headers, that doesn’t explain why we only have the problem with the http2 enabled between sidecars.

In both cases, http1 and http2, the connection between consumer gateway sidecar and app is http1.1

@lambdai
Copy link
Contributor

lambdai commented Jul 30, 2020

No I promise you 100% consumer gateway has not written any headers.

The DPE happens way way before the app attempts to write the response.

I trust you, but I am wondering how could the gateway app attempt to write anything to gateway sidecar. It's http1, a strict request-response model.

@Stono
Copy link
Contributor Author

Stono commented Jul 30, 2020

@lambdai perhaps we're getting lost in translation here...

I'm saying consumer-gateway app is writing the response back, but that fails with a broken pipe exception as it's trying to write a response on a connection that has been closed by the source. This is how java manifests a downstream (in this case the consumer-gateway sidecar) disconnection. Take this trace where consumer-gateway cannot return anything until sauron-web has responded (as consumer-gateway is a simple java zuul proxy):

Screenshot 2020-07-30 at 19 35 43

(note: there's actually a lot more under sauron-web, i've just trimmed it in the screenshot)

This is what I believe is happening:
Screenshot 2020-07-30 at 19 37 21

The issue can't be in the request headers, as the request makes it to consumer gateway (as it starts making its downstream requests). It's something happening in between the request getting to the app, and before the app returns response data.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Oct 29, 2020
@howardjohn howardjohn added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed and removed lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while labels Oct 29, 2020
@Stono
Copy link
Contributor Author

Stono commented Dec 7, 2020

@howardjohn @mandarjog @lambdai we recently tried turning on http2 again on 1.7 (this issue was raised on 1.5) and can confirm after a few days (3) we started to see the same failures, and backed it out. I don't really have any more information other than what has been shared on this thread.

The extremely difficult thing here it is takes days to start manifesting and when it does it's only on a service that receives very high throughput - so capturing istio-proxy logs can be difficult.

@Stono
Copy link
Contributor Author

Stono commented Dec 7, 2020

In fact one other piece of information I was able to glean from our edge logs was that the issues affected all different services routing through consumer-gateway.

So as a refresher, consumer-gateway is a zuul proxy to a variety (20+) downstream apps. During the latest incident, as you can see here, all different paths (so subsequently all different services) started having errors. This points more towards an Istio level issue, rather than an application level issue like @lambdai pointed out (eg duplicated headers etc) as the impact was simply too broad (and all consumer-gateway does is blind reverse-proxy, and was not modified during this time).

Screenshot 2020-12-07 at 09 43 26

@Stono
Copy link
Contributor Author

Stono commented Feb 28, 2021

As part of #31136 i've turned http2 on for one of our services again. This time we're on 1.8.3

@Stono
Copy link
Contributor Author

Stono commented Oct 29, 2021

@howardjohn as you saw in slack we decided to make http2 the default again.
After a indeterminate amount of time, we start to observe 503UC.

Screenshot 2021-10-29 at 06 58 07

You can see from the trace it's effectively the first "hop" that fails:

Screenshot 2021-10-29 at 07 01 00

With a DPE:
Screenshot 2021-10-29 at 07 01 27

So whatever this issue is, it's still present

@Stono
Copy link
Contributor Author

Stono commented Oct 29, 2021

Interestingly, it's the same service as last year (ingress-nginx -> consumer-gateway)

@mandarjog
Copy link
Contributor

What are the response flag details ?
Is the following the correct flow?
Nginx-ingress + envoy —> (h2) —> envoy sidecar + zuul ( consumer gw)
This means the sidecar is encoding h2 which it cannot decode on the other side…

@Stono
Copy link
Contributor Author

Stono commented Oct 29, 2021

What are the response flag details ?

What do you mean sorry? Is there some other info I can get for you?

Is the following the correct flow?

Yes that's the correct flow

This means the sidecar is encoding h2 which it cannot decode on the other side…

Yeah but why would that happen? And also, why would it just degrade without any releases or code changes, and be immediately fixed by toggling http2 on and off! This happened at 6am one morning, http2 had been enabled on consumer-gateway config quite some time before that.

@mandarjog
Copy link
Contributor

https://istio.io/latest/docs/tasks/observability/logs/access-log/#default-access-log-format Look for %RESPONSE_CODE_DETAILS%

It is possible that this is data dependent. So some specific headers may bring out this behaviour in h2 processing .

@lambdai is there a way to dump full request contents when a DPE is detected? We may need a special istio-proxy build.

@Stono
Copy link
Contributor Author

Stono commented Oct 29, 2021

So we don't have the access log enabled by default as its noisy and gives us little over what Jaeger does.
And restarting the workload will make the problem temporarily go away (so we can't add annotations).
Is there any way for me to enable access log, at run time?
Or perhaps an EnvoyFilter that would log out the request context when response 0DPE?

@swarnar87
Copy link

swarnar87 commented Apr 30, 2022

We have the exact same issue. Any way to log ingress request headers? This is the error in the logs:
[debug][http] [source/common/http/conn_manager_impl.cc:322] [C18109] dispatch error: http/1.1 protocol error: HPE_INVALID_HEADER_TOKEN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
Status: P1
Development

No branches or pull requests

9 participants