-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy does not honor request cancellation #986
Comments
I suspect that this is related to #952 Everything goes fine in the bb-terminus proxy until a connect error is encountered:
This log message prints repeatedly and it appears that it is never properly re-bound. |
Ah hah! It appears to me that the proxy is doing the right thing: there is nothing accepting connections on port 9090:
Checking the container's logs, it appears that nothing is logged after the container was supposed to have restarted:
Though, it does appear the container is thought the be running:
|
This configuration appears to reproduce this buggy behavior without the proxy being in the loop: we run slow-cooker, bb-p2p, and bb-terminus all in a single pod. slow-cooker's throughput drops similarly. I'm going to close this Conduit issue, since this doesn't seem worth pursuing further here. Please reopen this if we find something that indicates a Conduit issue. |
Adding services may have been a red herring, a successful run with either gist should result in a |
As siggy mentioned, the nature of the repro has made it hard to nail this down exactly. When a container restarts, the proxy attempts to rebind the client to that endpoint. This appears to work more-or-less as intended: the terminus container completes its work and exits and, when kubernetes restarts the process, the proxy reconnects properly. However, sometimes the proxy continues to attempt to reconnect to the local container, but it continually gets connection refused errors. I suspected that, in this situation, the terminus container is unable to start because the proxy is too aggressively consuming CPU resources (attempting to reconnect). To validate this theory, I created a branch that adds a delay into the reconnect logic so that the terminus container has a better shot of getting on its feet. This does not resolve the issue. The proxy can still get into situations where it continually receives an error:
Looking at the terminus container, its status is reported as:
And in the logs I see:
This seems to indicate that the terminus application is still running but no longer accepting connections! The process started at 14:46:55 and logged that it was shutting down at 14:47:15, but it hasn't actually shut down! Here's the configuration I'm using: https://gist.github.com/olix0r/c87538ca14ca93bb5f940780079edcfa |
I cannot reproduce this with HTTP/1 but I can reproduce this with HTTP/2. It appears that there is a pending request that prevents the server from terminating:
|
The server sees only one pending request:
The middle tier sees 27 (and growing) pending requests:
|
Removing the proxy from either the middle or server tiers seems to resolve the behavior. When both the middle and server tiers have the proxy, this bug is exhibited. Fairly reliably, this occurs on the 3rd iteration of the server (i.e. after 2 restarts). Shortly before the test breaks, we see the terminus server send GoAway frames to the proxy:
|
It looks like request |
Request 1mid sends a request to srv, which connects to its application and sends the request.
Request 2mid sends another request to srv. As mid routes the request, it processes service discovery updates reflecting the fact
Again, srv connects to its application and sends the request. The application responds
Request 3mid sends another request to srv. This time, a destination removal event hasn't been
Request 4mid sends another request to srv. This time, the only endpoint is removed from service
But the endpoint is restored before the bind timeout is reached. As the endpoint is
Here's where things get weird: somehow two requests are issued to the application. We Perhaps Request 3, which was canceled, is somehow queued and partially sent?
|
After some discussion with @seanmonstar, we think that tower-h2's server likely needs to be updated to handle cancellation. tower-rs/tower-h2#29 |
This includes the changes that should detect when a client sends a `RST_STREAM`, and cancels our pending response future or streaming body. Closes #986
This includes the changes that should detect when a client sends a `RST_STREAM`, and cancels our pending response future or streaming body. Closes linkerd#986
Using the lifecycle test environment (introduced in https://github.com/runconduit/conduit-examples/pull/41) we're observing a gradual throughput drop from 3000qps with 100% SR to 0qps.
It's unclear whether this issue lies in the proxy, test environment, or some interaction with Kubernetes.
Setup
Test environment topology:
The
bb terminus
server exits after one minute's worth of requests, causing the gRPC connection to fail. Kubernetes then restarts the container in the existing pod.Repro
These results were produced on Docker for Mac with Kubernetes.
Build Conduit off of master, currently at https://github.com/runconduit/conduit/tree/c5f0adafc8a831c0cc19b9abdf2fa1016e1942c5 (with #985 applied for docker build fix).
Deploy conduit and test environment
Logs
slow-cooker
proxy log in
bb-p2p
podtail of
bb-p2p
proxy with--proxy-log-level debug,conduit_proxy=debug
:https://gist.github.com/siggy/874ea90f5f4c35ec7260b835bef82625
proxy log in
bb terminus
podtail of
bb terminus
proxy with--proxy-log-level debug,conduit_proxy=debug
:https://gist.github.com/siggy/654a54be40175783f3d45c86933292ec
Baseline
Confirmed that without proxy injection, this issue does not arise.
cat lifecycle.yml | kubectl apply -f -
Note that some degradation does occur, most likely due to https://github.com/runconduit/conduit-examples/issues/42.
The text was updated successfully, but these errors were encountered: