Connections cannot be re-established after network loss/recovery #3266

stephenh · 2017-07-22T00:34:43Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.5.0

What JVM are you using (`java -version`)?

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

What did you do?

Ran a grpc-java client program with netty that uses application-level pings to a grpc-java server. Ran the client in a loop, it will ping, sleep, ping, sleep. If I disconnect the network, I get deadline exceeded (good), but if I reconnect the network, I continue to get deadline exceeded messages.

If possible, provide a recipe for reproducing the error.

Build http://github.com/stephenh/mirror
On one machine, run ./mirror server
On another machine, run ConnectionDetector.Impl.main (e.g. in an IDE)
Disrupt the network (for me, I log off the VPN), see deadline exceeded start happening
Reconnect the network (re-connect to VPN), deadline exceeded keeps happening

What did you expect to see?

For new connections work successfully after the network was restored.

FWIW, while debugging the issue, I paused the ClientCalls thread and poked around for awhile, e.g. ~5-10 minutes. I didn't really find anything, but when I hit "resume", I saw a broken pipe exception (which I don't usually see, usually it's just the deadline exceeded), and then the connection started working. E.g. I don't want to lead you astray, but it seems like until this pipe was broken, the connection was not fully getting restarted.

Understood this may not be a grpc-java issue, but some underlying netty or even inherent TCP issue that I just don't understand.

The text was updated successfully, but these errors were encountered:

ericgribkoff · 2017-07-24T17:03:44Z

I'm not an expert on these issues, but how gRPC Java's own KeepAliveManager works is that it shuts down the connection if the server does not respond to the keep-alive ping within the configured time limit. I believe you would want to do something similar in your case when you get deadline exceeded. Otherwise, your outgoing RPCs will continue to use the broken connection. Until you see the broken pipe exception, the socket used by Netty/gRPC remains in a usable state as far as your side of the connection is aware. You'll eventually see the broken pipe exception, but this can take a while and your best bet is to tear down the connection as soon as you see the deadline exceeded error (or consider using gRPC Java's built-in keep-alives, if they would satisfy your use case) and attempt to reconnect. Reconnecting will fail if the network is still down, and you can either set wait-for-ready on the stub or call options (e.g.,

grpc-java/stub/src/main/java/io/grpc/stub/AbstractStub.java

Line 186 in 166108a

public final S withWaitForReady() {

) or implement your own backoff and retry logic around the connection attempts.

stephenh · 2017-07-25T11:12:22Z

Thanks @ericgribkoff! You're right, making a new channel each time works.

I learned a few things from poking around there:

I had assumed the keep alive support wouldn't work for streams, e.g. the channel might be able to make new ping request/responses, but my previously-established stream might somehow be dead. But if they're both using the same underlying transport, and the pings happen on the same transport as the stream, then I can stop doing my own heartbeat-over-the-stream hacks.
If Implement channel-state API in ManagedChannelImpl #2292 was implemented, I'd have exactly what I need, as I just want to be able to pause WIP processing when the transport fails, and then be told when I can restart it when the transport can be re-established.

So, I might have to keep my hacky manual reconnect code until #2292 is fixed, but it will be great to get rid of it when I can.

Thanks again!

stephenh · 2017-07-25T17:56:20Z

FWIW I started using KeepAliveManager, but am seeing pings sent too often, and so consistently get "more calm" errors. Filed more details as #3274.

stephenh closed this as completed Jul 25, 2017

wenbozhu added the usability label Aug 3, 2017

lock bot locked and limited conversation to collaborators Sep 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connections cannot be re-established after network loss/recovery #3266

Connections cannot be re-established after network loss/recovery #3266

stephenh commented Jul 22, 2017

ericgribkoff commented Jul 24, 2017

stephenh commented Jul 25, 2017

stephenh commented Jul 25, 2017

Connections cannot be re-established after network loss/recovery #3266

Connections cannot be re-established after network loss/recovery #3266

Comments

stephenh commented Jul 22, 2017

What version of gRPC are you using?

What JVM are you using (java -version)?

What did you do?

What did you expect to see?

ericgribkoff commented Jul 24, 2017

stephenh commented Jul 25, 2017

stephenh commented Jul 25, 2017

What JVM are you using (`java -version`)?