Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy freeze on outbound request with long GET request #2859

Closed
tanuck opened this issue May 29, 2019 · 11 comments
Closed

Proxy freeze on outbound request with long GET request #2859

tanuck opened this issue May 29, 2019 · 11 comments
Assignees
Labels

Comments

@tanuck
Copy link
Contributor

tanuck commented May 29, 2019

Bug Report

What is the issue?

When making an outbound GET (other http methods may also be affected) request with a long url, the request blocks and the proxy exhibits an endless memory leak and cpu spike until permanently freezing. This usually results in pod restarts because health checks can not reach the application container.

How can it be reproduced?

We originally reproduced this by rendering a grafana chart with a long prometheus query (>6000 characters). This then caused the proxy freeze when traefik was forwarding the request to the grafana pod.

But you can also exec to any meshed pod and perform an outbound http request to another pod with an arbitrarily large url (~6kb)

Logs, error output, etc

No logs with default log levels

linkerd check output

$ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist

linkerd-existence
-----------------
√ control plane components ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 19.5.2 but the latest edge version is 19.5.3
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 19.5.2 but the latest edge version is 19.5.3
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

Status check results are √

Environment

  • Kubernetes Version: v1.13.5-gke.10
  • Cluster Environment: GKE
  • Host OS: cos
  • Linkerd version: edge-19.5.2/edge-19.5.3

Possible solution

Additional context

If there is some acceptable max query/url length somewhere in the rust stack the proxy should at least handle that gracefully back to the client and not freeze.

@olix0r
Copy link
Member

olix0r commented May 29, 2019

@tanuck thanks for the report! we're actively looking into this class of issue, but the repro we've been tracking sounds a bit different.

We originally reproduced this by rendering a grafana chart with a long prometheus query (>6000 characters). This then caused the proxy freeze when traefik was forwarding the request to the grafana pod.

This is surprising! Can you share specific repro steps? Is there a curl command that can trigger this behavior, for instance?

@olix0r olix0r self-assigned this May 29, 2019
@tanuck
Copy link
Contributor Author

tanuck commented May 29, 2019

@olix0r yeah there seems to be a few issues revolving around cpu/memory spikes but I couldn't seem to find anything with similar symptoms we were seeing.

There's an nginx yaml and an example curl command that can reproduce this in the following gist
https://gist.github.com/tanuck/43282a75d2915bd4e1ebff63cf1c8611

FYI I've just verified this in edge-19.5.3 as well.

@olix0r
Copy link
Member

olix0r commented May 29, 2019

Can you try running linkerd upgrade --disable-h2-upgrade? This will prevent the proxy from multiplexing HTTP/1 streams over HTTP/2 connections.

@tanuck
Copy link
Contributor Author

tanuck commented May 29, 2019

Will do, my hunch from our investigations is that this will resolve it... will give it a go now.

@tanuck
Copy link
Contributor Author

tanuck commented May 29, 2019

@olix0r yup that fixes it. Is there a request header that can be set to disable the h2 upgrade?

@olix0r
Copy link
Member

olix0r commented May 29, 2019

For now, I'd just leave this in this state (and we may want to change the default on master until we identify the underlying issue). I'd hope the header wouldn't be generally useful outside of the rare case where the h2 codepath has a bug.

@hawkw
Copy link
Contributor

hawkw commented May 29, 2019

I've managed to reproduce this using the provided configuration and done some testing. This definitely seems like it's an h2 bug, possibly related to HPACK --- when sending the long request, the source and destination proxies both enter into a state where they loop forever.

The destination proxy logs these messages repeatedly:

default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021288s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize poll_complete
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021291s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize schedule_pending_open
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021294s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize pop_frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021298s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flush
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021301s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flushing buffer
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021303s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize try reclaim frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021377s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::settings send_pending_ack; pending=None
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021384s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021406s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll; bytes=9B
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021409s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read decoding frame from 9B
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021422s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read     -> kind=Continuation
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021426s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::hpack::decoder decode
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021430s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_read poll
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021436s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize try reclaim frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021439s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize poll_complete
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021442s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize schedule_pending_open
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021469s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::proto::streams::prioritize pop_frame
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021473s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flush
default nginx-5bbc84f8db-7l2bd linkerd-proxy TRCE [  3119.021475s] proxy={server=in listen=0.0.0.0:4143 remote=10.52.0.205:52902} h2::codec::framed_write flushing buffer

While the source proxy just logs:

linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757125s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757164s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757210s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757263s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757309s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757352s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.757901s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.758137s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame
linkerd linkerd-grafana-c59b86cf4-tkm77 linkerd-proxy TRCE [   645.758235s] proxy={client=out dst=10.52.0.162:80 proto=Http2} h2::codec::framed_write   -> not a queued data frame

Here's a gist with a longer section of the logs, starting with when outbound client connection is established, and a total log dump starting with when the two proxies were initially spun up.

@admc admc added the priority/P0 Release Blocker label May 29, 2019
@hawkw
Copy link
Contributor

hawkw commented May 29, 2019

There's an upstream PR hyperium/h2#360 in progress that should fix this issue. Thanks @tanuck for the report and repro steps!

@hawkw
Copy link
Contributor

hawkw commented May 31, 2019

I believe this was fixed in stable-2.3.1 by linkerd/linkerd2-proxy#261!

@olix0r
Copy link
Member

olix0r commented Jun 3, 2019

@tanuck We believe that we've fixed this issue in stable-2.3.1 and edge-19.5.4. Please reopen this issue if you still see issues!

@tanuck
Copy link
Contributor Author

tanuck commented Jun 5, 2019

@olix0r yeah, confirmed the fix in edge-19.5.4. Thanks 👍

@tanuck tanuck closed this as completed Jun 5, 2019
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants