Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connections cannot be re-established after network loss/recovery #3266

Closed
stephenh opened this issue Jul 22, 2017 · 3 comments
Closed

Connections cannot be re-established after network loss/recovery #3266

stephenh opened this issue Jul 22, 2017 · 3 comments

Comments

@stephenh
Copy link
Contributor

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.5.0

What JVM are you using (java -version)?

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

What did you do?

Ran a grpc-java client program with netty that uses application-level pings to a grpc-java server. Ran the client in a loop, it will ping, sleep, ping, sleep. If I disconnect the network, I get deadline exceeded (good), but if I reconnect the network, I continue to get deadline exceeded messages.

If possible, provide a recipe for reproducing the error.

  • Build http://github.com/stephenh/mirror
  • On one machine, run ./mirror server
  • On another machine, run ConnectionDetector.Impl.main (e.g. in an IDE)
  • Disrupt the network (for me, I log off the VPN), see deadline exceeded start happening
  • Reconnect the network (re-connect to VPN), deadline exceeded keeps happening

What did you expect to see?

For new connections work successfully after the network was restored.

FWIW, while debugging the issue, I paused the ClientCalls thread and poked around for awhile, e.g. ~5-10 minutes. I didn't really find anything, but when I hit "resume", I saw a broken pipe exception (which I don't usually see, usually it's just the deadline exceeded), and then the connection started working. E.g. I don't want to lead you astray, but it seems like until this pipe was broken, the connection was not fully getting restarted.

Understood this may not be a grpc-java issue, but some underlying netty or even inherent TCP issue that I just don't understand.

@ericgribkoff
Copy link
Contributor

I'm not an expert on these issues, but how gRPC Java's own KeepAliveManager works is that it shuts down the connection if the server does not respond to the keep-alive ping within the configured time limit. I believe you would want to do something similar in your case when you get deadline exceeded. Otherwise, your outgoing RPCs will continue to use the broken connection. Until you see the broken pipe exception, the socket used by Netty/gRPC remains in a usable state as far as your side of the connection is aware. You'll eventually see the broken pipe exception, but this can take a while and your best bet is to tear down the connection as soon as you see the deadline exceeded error (or consider using gRPC Java's built-in keep-alives, if they would satisfy your use case) and attempt to reconnect. Reconnecting will fail if the network is still down, and you can either set wait-for-ready on the stub or call options (e.g.,

public final S withWaitForReady() {
) or implement your own backoff and retry logic around the connection attempts.

@stephenh
Copy link
Contributor Author

Thanks @ericgribkoff! You're right, making a new channel each time works.

I learned a few things from poking around there:

  1. I had assumed the keep alive support wouldn't work for streams, e.g. the channel might be able to make new ping request/responses, but my previously-established stream might somehow be dead. But if they're both using the same underlying transport, and the pings happen on the same transport as the stream, then I can stop doing my own heartbeat-over-the-stream hacks.

  2. If Implement channel-state API in ManagedChannelImpl #2292 was implemented, I'd have exactly what I need, as I just want to be able to pause WIP processing when the transport fails, and then be told when I can restart it when the transport can be re-established.

So, I might have to keep my hacky manual reconnect code until #2292 is fixed, but it will be great to get rid of it when I can.

Thanks again!

@stephenh
Copy link
Contributor Author

FWIW I started using KeepAliveManager, but am seeing pings sent too often, and so consistently get "more calm" errors. Filed more details as #3274.

@lock lock bot locked and limited conversation to collaborators Sep 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants