Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Linkerd fails to reset its client streams in Half Closed (remote) state when the upstream client disconnects #1696
A gRPC client initiates a bidirectional streaming GET request to Linkerd, which initiates the same request to a server. The server responds with a headers frame with the EOS bit set (in this case, a gRPC error), placing the stream in the "Half Closed (remote)" state, while Linkerd is still sending the request body. Then, the upstream client disconnects. This results in a synthetic remote reset. The reset cancels all of Linkerd's server streams to the client, but fails to cancel the client streams to the downstream server which are in the Half Closed (remote) state (client streams not in this state appear to be properly reset).
This results in a memory leak, as these streams continue to be tracked after the connection is terminated. This can be observed using linkerd's
What you expected to happen:
The synthetic remote reset on the server dispatcher should be propagated to the client dispatcher. All client streams should be reset.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Alright, I've figured out exactly why this is happening. Consider the following code, in
When a server stream is reset, it raises the reset to
I added some debug logging to verify this theory, and it looks like this is indeed the case:
referenced this issue
Nov 16, 2017
referenced this issue
Nov 21, 2017
Hi @draveness, I'm sorry that I missed your original comment back in September. This issue had been de-prioritized because there had not been any reports of this actually happening in production. But if you're running into it, then I'd definitely like to get it fixed.
What are you able to share about the production environment where this is occurring? Do you have steps that can reliably reproduce the issue?
@adleong Sorry, I can't provide the approach to reproduce the issue in 1.4.6, since it's several months ago and we moved forward istio after running into this problem. But it may be fixed in the 1.5.1 release. Thanks for your reply anyway.