Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upConnections cannot be re-established after network loss/recovery #3266
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ericgribkoff
Jul 24, 2017
Contributor
I'm not an expert on these issues, but how gRPC Java's own KeepAliveManager works is that it shuts down the connection if the server does not respond to the keep-alive ping within the configured time limit. I believe you would want to do something similar in your case when you get deadline exceeded. Otherwise, your outgoing RPCs will continue to use the broken connection. Until you see the broken pipe exception, the socket used by Netty/gRPC remains in a usable state as far as your side of the connection is aware. You'll eventually see the broken pipe exception, but this can take a while and your best bet is to tear down the connection as soon as you see the deadline exceeded error (or consider using gRPC Java's built-in keep-alives, if they would satisfy your use case) and attempt to reconnect. Reconnecting will fail if the network is still down, and you can either set wait-for-ready on the stub or call options (e.g.,
) or implement your own backoff and retry logic around the connection attempts.|
I'm not an expert on these issues, but how gRPC Java's own KeepAliveManager works is that it shuts down the connection if the server does not respond to the keep-alive ping within the configured time limit. I believe you would want to do something similar in your case when you get deadline exceeded. Otherwise, your outgoing RPCs will continue to use the broken connection. Until you see the broken pipe exception, the socket used by Netty/gRPC remains in a usable state as far as your side of the connection is aware. You'll eventually see the broken pipe exception, but this can take a while and your best bet is to tear down the connection as soon as you see the deadline exceeded error (or consider using gRPC Java's built-in keep-alives, if they would satisfy your use case) and attempt to reconnect. Reconnecting will fail if the network is still down, and you can either set wait-for-ready on the stub or call options (e.g., ) or implement your own backoff and retry logic around the connection attempts. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
stephenh
Jul 25, 2017
Contributor
Thanks @ericgribkoff! You're right, making a new channel each time works.
I learned a few things from poking around there:
-
I had assumed the keep alive support wouldn't work for streams, e.g. the channel might be able to make new ping request/responses, but my previously-established stream might somehow be dead. But if they're both using the same underlying transport, and the pings happen on the same transport as the stream, then I can stop doing my own heartbeat-over-the-stream hacks.
-
If #2292 was implemented, I'd have exactly what I need, as I just want to be able to pause WIP processing when the transport fails, and then be told when I can restart it when the transport can be re-established.
So, I might have to keep my hacky manual reconnect code until #2292 is fixed, but it will be great to get rid of it when I can.
Thanks again!
|
Thanks @ericgribkoff! You're right, making a new channel each time works. I learned a few things from poking around there:
So, I might have to keep my hacky manual reconnect code until #2292 is fixed, but it will be great to get rid of it when I can. Thanks again! |
stephenh
closed this
Jul 25, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
stephenh
Jul 25, 2017
Contributor
FWIW I started using KeepAliveManager, but am seeing pings sent too often, and so consistently get "more calm" errors. Filed more details as #3274.
|
FWIW I started using KeepAliveManager, but am seeing pings sent too often, and so consistently get "more calm" errors. Filed more details as #3274. |
stephenh commentedJul 22, 2017
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.5.0
What JVM are you using (
java -version)?java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
What did you do?
Ran a grpc-java client program with netty that uses application-level pings to a grpc-java server. Ran the client in a loop, it will ping, sleep, ping, sleep. If I disconnect the network, I get deadline exceeded (good), but if I reconnect the network, I continue to get deadline exceeded messages.
If possible, provide a recipe for reproducing the error.
What did you expect to see?
For new connections work successfully after the network was restored.
FWIW, while debugging the issue, I paused the ClientCalls thread and poked around for awhile, e.g. ~5-10 minutes. I didn't really find anything, but when I hit "resume", I saw a broken pipe exception (which I don't usually see, usually it's just the deadline exceeded), and then the connection started working. E.g. I don't want to lead you astray, but it seems like until this pipe was broken, the connection was not fully getting restarted.
Understood this may not be a grpc-java issue, but some underlying netty or even inherent TCP issue that I just don't understand.