New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd fails to reset its client streams in Half Closed (remote) state when the upstream client disconnects #1696

Closed
hawkw opened this Issue Nov 9, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@hawkw
Copy link
Member

hawkw commented Nov 9, 2017

Issue Type:

  • Bug report
  • Feature request

What happened:

A gRPC client initiates a bidirectional streaming GET request to Linkerd, which initiates the same request to a server. The server responds with a headers frame with the EOS bit set (in this case, a gRPC error), placing the stream in the "Half Closed (remote)" state, while Linkerd is still sending the request body. Then, the upstream client disconnects. This results in a synthetic remote reset. The reset cancels all of Linkerd's server streams to the client, but fails to cancel the client streams to the downstream server which are in the Half Closed (remote) state (client streams not in this state appear to be properly reset).

This results in a memory leak, as these streams continue to be tracked after the connection is terminated. This can be observed using linkerd's rt/client/.../open_streams metric, which should remain the same after the connection is closed.

What you expected to happen:

The synthetic remote reset on the server dispatcher should be propagated to the client dispatcher. All client streams should be reset.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • linkerd/namerd version, config files: linkerd 1.3.2
  • Platform, version, and config files (Kubernetes, DC/OS, etc): running locally with strest-grpc
  • Cloud provider or hardware configuration: running locally (MacBook)

@hawkw hawkw added bug gRPC h2 labels Nov 9, 2017

@hawkw hawkw self-assigned this Nov 10, 2017

@hawkw hawkw added this to Bug in Linkerd Roadmap Nov 14, 2017

@hawkw

This comment has been minimized.

Copy link
Member Author

hawkw commented Nov 14, 2017

Alright, I've figured out exactly why this is happening. Consider the following code, in Netty4ServerDispatcher.serveStream():

val serveF = st.onRecvMessage.flatMap(serve).flatMap(st.send(_).flatten)
// When the stream is reset, ensure that the cancelation is
// propagated downstream.
st.onReset.onFailure {
case StreamError.Remote(rst: Reset) => serveF.raise(rst)
case StreamError.Remote(e) => serveF.raise(Reset.Cancel)
case e => serveF.raise(e)
}

When a server stream is reset, it raises the reset to serveF, a future chain which should include the Service passed to the Netty4ServerDispatcher constructor; namely, the corresponding Netty4ClientDispatcher (and its underlying stream transport). If that future is still pending, raising the reset exception will reset that stream. HOWEVER, if the future has finished, the client stream will not be reset. It appears that streams entering the half-closed (remote) state satisfy the future, and thus are never reset.

I added some debug logging to verify this theory, and it looks like this is indeed the case:

D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:1] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:1] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:17] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:17] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:3] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:3] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:7] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:7] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:9] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:9] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:11] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:11] stream reset from remote: Reset.Cancel
D 1114 23:15:53.966 UTC THREAD27: [S L:/127.0.0.1:4140 R:/127.0.0.1:61757 S:13] st.onReset -> serveF.raise(Reset.Cancel); serveF.isDone=true
@hawkw

This comment has been minimized.

Copy link
Member Author

hawkw commented Nov 16, 2017

D 1115 23:58:32.350 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] go away: GoAway.InternalError
D 1115 23:58:32.350 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] resetting all streams: Reset.Cancel
D 1115 23:58:32.353 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:1] resetting Reset.Cancel in LocalClosed(remote: RemoteStreaming)
D 1115 23:58:32.353 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] reset Netty4StreamTransport.Server(state: Closed(error: Reset.Cancel)): Reset.Cancel
D 1115 23:58:32.353 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:17] resetting Reset.Cancel in LocalClosed(remote: RemoteStreaming)
D 1115 23:58:32.353 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] reset Netty4StreamTransport.Server(state: Closed(error: Reset.Cancel)): Reset.Cancel
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:19] resetting Reset.Cancel in RemoteClosed(15 offers)
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:19] RemoteClosed(15 offers).reset(Reset.Cancel); q failed
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] reset Netty4StreamTransport.Server(state: Closed(error: Reset.Cancel)): Reset.Cancel
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:21] resetting Reset.Cancel in RemoteClosed(0 offers)
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:21] RemoteClosed(0 offers).reset(Reset.Cancel); q failed
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] reset Netty4StreamTransport.Server(state: Closed(error: Reset.Cancel)): Reset.Cancel
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:13] resetting Reset.Cancel in LocalClosed(remote: RemoteStreaming)
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] reset Netty4StreamTransport.Server(state: Closed(error: Reset.Cancel)): Reset.Cancel
D 1115 23:58:32.354 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] streams={1=StreamRemoteReset$, 17=StreamRemoteReset$, 19=StreamRemoteReset$, 21=StreamRemoteReset$, 13=StreamRemoteReset$}
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:1] reset by Reset.Cancel, raising to serveF
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:1] stream reset from remote: Reset.Cancel
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:17] reset by Reset.Cancel, raising to serveF
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:17] stream reset from remote: Reset.Cancel
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:19] reset by Reset.Cancel, raising to serveF
D 1115 23:58:32.355 UTC THREAD35: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] remote message interrupted: Reset.Cancel
D 1115 23:58:32.355 UTC THREAD35: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] resetting Reset.Cancel in Open(remote: RemotePending)
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:19] stream reset from remote: Reset.Cancel
D 1115 23:58:32.355 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:21] reset by Reset.Cancel, raising to serveF
D 1115 23:58:32.355 UTC THREAD35: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:23] remote message interrupted: Reset.Cancel
D 1115 23:58:32.355 UTC THREAD35: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:23] resetting Reset.Cancel in LocalClosed(remote: RemotePending)
D 1115 23:58:32.356 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:21] stream reset from remote: Reset.Cancel
D 1115 23:58:32.356 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:13] reset by Reset.Cancel, raising to serveF
D 1115 23:58:32.356 UTC THREAD36 TraceId:a6e34d95c3925567: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] remote write failed: Reset.Cancel
D 1115 23:58:32.356 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533 S:13] stream reset from remote: Reset.Cancel
I 1115 23:58:32.368 UTC THREAD35 TraceId:a6e34d95c3925567: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] onReset.onFailure: StreamError.Local(Reset.Cancel)
D 1115 23:58:32.368 UTC THREAD35 TraceId:a6e34d95c3925567: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] stream reset from local: Reset.Cancel
D 1115 23:58:32.368 UTC THREAD35 TraceId:a6e34d95c3925567: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:21] stream reset from local; resetting remote: Reset.Cancel
I 1115 23:58:32.370 UTC THREAD35 TraceId:fd56186acc091384: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:23] onReset.onFailure: StreamError.Local(Reset.Cancel)
D 1115 23:58:32.370 UTC THREAD35 TraceId:fd56186acc091384: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:23] stream reset from local: Reset.Cancel
D 1115 23:58:32.370 UTC THREAD35 TraceId:fd56186acc091384: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:23] stream reset from local; resetting remote: Reset.Cancel
D 1115 23:58:32.375 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] stream read failed: Reset.Cancel
D 1115 23:58:32.375 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] resetting Reset.Cancel in RemoteClosed(0 offers)
D 1115 23:58:32.375 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] RemoteClosed(0 offers).reset(Reset.Cancel); q failed
I 1115 23:58:32.377 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] onReset.onFailure: StreamError.Local(Reset.Cancel)
D 1115 23:58:32.377 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] stream reset from local: Reset.Cancel
D 1115 23:58:32.377 UTC THREAD35 TraceId:8950e59880c20995: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888 S:19] stream reset from local; resetting remote: Reset.Cancel
W 1115 23:58:32.379 UTC THREAD35 TraceId:a6e34d95c3925567: Exception propagated to the default monitor (upstream address: /127.0.0.1:65533, downstream address: localhost/127.0.0.1:8888, label: $/inet/localhost/8888).
Reset.Cancel

W 1115 23:58:32.379 UTC THREAD35 TraceId:fd56186acc091384: Exception propagated to the default monitor (upstream address: /127.0.0.1:65533, downstream address: localhost/127.0.0.1:8888, label: $/inet/localhost/8888).
Reset.Cancel

D 1115 23:58:32.386 UTC THREAD35: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] transport closed: com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /127.0.0.1:65533. Remote Info: Not Available
D 1115 23:58:59.292 UTC THREAD11: [S L:/127.0.0.1:4140 R:/127.0.0.1:65533] streams={1=StreamRemoteReset$, 17=StreamRemoteReset$, 19=StreamRemoteReset$, 21=StreamRemoteReset$, 13=StreamRemoteReset$}
D 1115 23:58:59.611 UTC THREAD11: [C L:/127.0.0.1:65534 R:localhost/127.0.0.1:8888] streams={17=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 3=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 19=StreamLocalReset$, 5=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 21=StreamLocalReset$, 7=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 23=StreamLocalReset$, 9=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 11=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 13=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers))), 15=StreamOpen(Netty4StreamTransport.Client(state: RemoteClosed(0 offers)))}
^X^CI 1115 23:59:10.969 UTC THREAD44: Received SIGINT. Shutting down ...
I 1115 23:59:10.995 UTC THREAD44: Reaping /svc/localhost:4140
I 1115 23:59:11.028 UTC THREAD44: Reaping $/inet/localhost/8888
@draveness

This comment has been minimized.

Copy link

draveness commented Sep 21, 2018

Hi, we met the same problem in production, and how is this issue going on?

@draveness

This comment has been minimized.

Copy link

draveness commented Jan 17, 2019

We removed Linkerd from production because of this problem. 😓

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Jan 17, 2019

Hi @draveness, I'm sorry that I missed your original comment back in September. This issue had been de-prioritized because there had not been any reports of this actually happening in production. But if you're running into it, then I'd definitely like to get it fixed.

What are you able to share about the production environment where this is occurring? Do you have steps that can reliably reproduce the issue?

@adleong adleong moved this from Bug to Needs more information in Linkerd Roadmap Jan 17, 2019

@draveness

This comment has been minimized.

Copy link

draveness commented Jan 18, 2019

@adleong Sorry, I can't provide the approach to reproduce the issue in 1.4.6, since it's several months ago and we moved forward istio after running into this problem. But it may be fixed in the 1.5.1 release. Thanks for your reply anyway.

HTTP/2
Fixes an HTTP/2 issue that causes Linkerd to stop processing incoming frames on an HTTP/2
connection after Linkerd sends a RST_STREAM frame to its remote peer. This was causing gRPC
clients to experience timeout errors intermittently because connections between Linkerd and its
remote peers weren't being closed properly.
Sets the maxConcurrentStreamsPerConnection config value for the h2 router to 1000 by default
to prevent Linkerd from running out of memory when HTTP/2 clients leak connection streams.

@adleong

This comment has been minimized.

Copy link
Member

adleong commented Feb 4, 2019

Thanks for the update, @draveness. I will close this issue until we get a reproducible report of this. Please re-open if you run into this again!

@adleong adleong closed this Feb 4, 2019

@adleong adleong removed this from Needs more information in Linkerd Roadmap Feb 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment