Proxy freeze on outbound request with long GET request #2859

tanuck · 2019-05-29T14:45:18Z

Bug Report

What is the issue?

When making an outbound GET (other http methods may also be affected) request with a long url, the request blocks and the proxy exhibits an endless memory leak and cpu spike until permanently freezing. This usually results in pod restarts because health checks can not reach the application container.

How can it be reproduced?

We originally reproduced this by rendering a grafana chart with a long prometheus query (>6000 characters). This then caused the proxy freeze when traefik was forwarding the request to the grafana pod.

But you can also exec to any meshed pod and perform an outbound http request to another pod with an arbitrarily large url (~6kb)

Logs, error output, etc

No logs with default log levels

`linkerd check` output

$ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist

linkerd-existence
-----------------
√ control plane components ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 19.5.2 but the latest edge version is 19.5.3
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 19.5.2 but the latest edge version is 19.5.3
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: v1.13.5-gke.10
Cluster Environment: GKE
Host OS: cos
Linkerd version: edge-19.5.2/edge-19.5.3

Possible solution

Additional context

If there is some acceptable max query/url length somewhere in the rust stack the proxy should at least handle that gracefully back to the client and not freeze.

The text was updated successfully, but these errors were encountered:

olix0r · 2019-05-29T15:01:16Z

@tanuck thanks for the report! we're actively looking into this class of issue, but the repro we've been tracking sounds a bit different.

We originally reproduced this by rendering a grafana chart with a long prometheus query (>6000 characters). This then caused the proxy freeze when traefik was forwarding the request to the grafana pod.

This is surprising! Can you share specific repro steps? Is there a curl command that can trigger this behavior, for instance?

tanuck · 2019-05-29T15:16:17Z

@olix0r yeah there seems to be a few issues revolving around cpu/memory spikes but I couldn't seem to find anything with similar symptoms we were seeing.

There's an nginx yaml and an example curl command that can reproduce this in the following gist
https://gist.github.com/tanuck/43282a75d2915bd4e1ebff63cf1c8611

FYI I've just verified this in edge-19.5.3 as well.

olix0r · 2019-05-29T15:37:02Z

Can you try running linkerd upgrade --disable-h2-upgrade? This will prevent the proxy from multiplexing HTTP/1 streams over HTTP/2 connections.

tanuck · 2019-05-29T15:45:28Z

Will do, my hunch from our investigations is that this will resolve it... will give it a go now.

tanuck · 2019-05-29T15:52:24Z

@olix0r yup that fixes it. Is there a request header that can be set to disable the h2 upgrade?

olix0r · 2019-05-29T16:37:56Z

For now, I'd just leave this in this state (and we may want to change the default on master until we identify the underlying issue). I'd hope the header wouldn't be generally useful outside of the rare case where the h2 codepath has a bug.

hawkw · 2019-05-29T18:40:29Z

I've managed to reproduce this using the provided configuration and done some testing. This definitely seems like it's an h2 bug, possibly related to HPACK --- when sending the long request, the source and destination proxies both enter into a state where they loop forever.

The destination proxy logs these messages repeatedly:

default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021288s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize poll_complete
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021291s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize schedule_pending_open
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021294s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize pop_frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021298s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flush
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021301s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flushing buffer
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021303s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize try reclaim frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021377s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::settings send_pending_ack; pending=None
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021384s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021406s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll; bytes=9B
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021409s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read decoding frame from 9B
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021422s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read     -> kind=Continuation
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021426s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::hpack::decoder decode
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021430s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021436s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize try reclaim frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021439s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize poll_complete
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021442s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize schedule_pending_open
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021469s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize pop_frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021473s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flush
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021475s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flushing buffer

While the source proxy just logs:

linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757125s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757164s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757210s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757263s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757309s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757352s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757901s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.758137s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.758235s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame

Here's a gist with a longer section of the logs, starting with when outbound client connection is established, and a total log dump starting with when the two proxies were initially spun up.

hawkw · 2019-05-29T22:13:57Z

There's an upstream PR hyperium/h2#360 in progress that should fix this issue. Thanks @tanuck for the report and repro steps!

hawkw · 2019-05-31T21:59:57Z

I believe this was fixed in stable-2.3.1 by linkerd/linkerd2-proxy#261!

olix0r · 2019-06-03T21:35:56Z

@tanuck We believe that we've fixed this issue in stable-2.3.1 and edge-19.5.4. Please reopen this issue if you still see issues!

tanuck · 2019-06-05T10:20:50Z

@olix0r yeah, confirmed the fix in edge-19.5.4. Thanks 👍

olix0r self-assigned this May 29, 2019

admc added the priority/P0 Release Blocker label May 29, 2019

hawkw added area/proxy bug labels May 29, 2019

tanuck closed this as completed Jun 5, 2019

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy freeze on outbound request with long GET request #2859

Proxy freeze on outbound request with long GET request #2859

tanuck commented May 29, 2019 •

edited by seanmonstar

Loading

olix0r commented May 29, 2019

tanuck commented May 29, 2019

olix0r commented May 29, 2019

tanuck commented May 29, 2019

tanuck commented May 29, 2019

olix0r commented May 29, 2019

hawkw commented May 29, 2019 •

edited

Loading

hawkw commented May 29, 2019

hawkw commented May 31, 2019

olix0r commented Jun 3, 2019

tanuck commented Jun 5, 2019

Proxy freeze on outbound request with long GET request #2859

Proxy freeze on outbound request with long GET request #2859

Comments

tanuck commented May 29, 2019 • edited by seanmonstar Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

olix0r commented May 29, 2019

tanuck commented May 29, 2019

olix0r commented May 29, 2019

tanuck commented May 29, 2019

tanuck commented May 29, 2019

olix0r commented May 29, 2019

hawkw commented May 29, 2019 • edited Loading

hawkw commented May 29, 2019

hawkw commented May 31, 2019

olix0r commented Jun 3, 2019

tanuck commented Jun 5, 2019

tanuck commented May 29, 2019 •

edited by seanmonstar

Loading

`linkerd check` output

hawkw commented May 29, 2019 •

edited

Loading