Skip to content
This repository has been archived by the owner on Nov 15, 2019. It is now read-only.

Fire LostPeer as soon as network interface becomes unacessible #578

Closed
vinipsmaker opened this issue Mar 7, 2016 · 10 comments
Closed

Fire LostPeer as soon as network interface becomes unacessible #578

vinipsmaker opened this issue Mar 7, 2016 · 10 comments

Comments

@vinipsmaker
Copy link
Contributor

A certain behavior was observed on osx:

When the wifi is turned off, you dont get a disconnect (LostPeer) raised on the osx machine nor the peer it’s connecting to even if thats a different platform.

For communication purposes (within this thread only), let client be the peer who turns off the wifi. server will be the other end.

TCP is reliable and connection-oriented and could be able to recover these connections if client re-enable its wifi soon enough. For server, the only option to get the LostPeer event will be to wait until timeout is reached. For client, the network interfaces could be watched and notification could be made immediately. In crust, we want client to be notified of connection loss immediately (this means watching network interfaces and ungracefully closing the TCP connection to avoid connection being restored by the OS).

Related issue: maidsafe-archive/get_if_addrs#17

@dirvine
Copy link
Contributor

dirvine commented Mar 7, 2016

I think unless the tcp stream actually times out we should not close the connection. Although it does cause issues like with ssh sessions so I see the issue here for sure. If we wait then it may appear very strange to a user. I think we may need an event NetworkDisconnected when we see no traffic (esp if we have hearbeats like in utp). This may reconnect or it may not though. So then the question is how to differentiate a temporary lost connection and a definitely lost connection?
I think when we see no traffic we can issue the NetworkDisconnected meaning clients need to show a waiting type image. If we then get timeouts we issue LostPeer events and the client has to restart crust?

@vinipsmaker
Copy link
Contributor Author

I think when we see no traffic we can issue the NetworkDisconnected meaning clients need to show a waiting type image.

It's not only an UI change. API should back off and not schedule/write new messages until connection is okay again.

#cc: @Viv-Rajkumar, @afck, @canndrew, @madadam (what is your opinion on this matter?)

@canndrew
Copy link
Contributor

canndrew commented Mar 9, 2016

It would be easy enough to use a shorter TCP timeout and fire a LostPeer event quickly after a connection goes down. We could even detect when a network interface goes down and fire a LostPeer for all our connected peers.

This NetworkDisconnected event sounds like something a lot more complicated. Currently, as far as crust is concerned, it's either connected to a peer or not. As soon as a connection goes down it fires a LostPeer event and disconnects. If we want to add a third state to mean "temporarily unavailable" then we'll need to think about how this will work. We will need to detect whether a connection went down because the peer disconnected or because of network loss etc. This means we'll need an extra crust messsage to say "I'm disconnecting from you". If the peer didn't disconnect then we'll need to try and re-establish the connection. We'll also need sequence numbers and the ability to re-send messages because when a connection gets re-established we won't know what the last message received by the remote peer was.

Also, is this NetworkDisconnected event meant to be per-peer or for the network as a whole? It's possible, for example, for our internet to go down but to still remain connected to other peers on the local network.

@vinipsmaker
Copy link
Contributor Author

It would be easy enough to use a shorter TCP timeout and fire a LostPeer event quickly after a connection goes down.

Not sure it's wise to do that. We'd be just making thing more difficult to users that experience temporary internet connectivity problems (downloading torrent, don't know).

We could even detect when a network interface goes down and fire a LostPeer for all our connected peers.

Not sure it's the correct behavior here either. So if one network interface goes down, all peers are disconnected? What if I disable my virtual network interface that is only up when I start my LXC/container jail? What happens to nodes connected via ethernet if I disable the wifi?

This NetworkDisconnected event sounds like something a lot more complicated. Currently, as far as crust is concerned, it's either connected to a peer or not. As soon as a connection goes down it fires a LostPeer event and disconnects.

Indeed. It is more complicated.

If we want to add a third state to mean "temporarily unavailable" then we'll need to think about how this will work. We will need to detect whether a connection went down because the peer disconnected or because of network loss etc. This means we'll need an extra crust messsage to say "I'm disconnecting from you". If the peer didn't disconnect then we'll need to try and re-establish the connection.

But this third state could just mean "hold on and do not send messages for a little bit, because it won't work now". It saves crust/operating-system to buffer several messages that will never be delivered. Also, if connection is really lost, the sent messages would be lost. With this change, the user can have the chance to know which messages would be lost if it tries to send the message before network is recovered.

Anyway, I'm waiting for routing guys (@afck, @madadam) feedback to know what they're willing to use.

We'll also need sequence numbers and the ability to re-send messages because when a connection gets re-established we won't know what the last message received by the remote peer was.

Why? This third state doesn't promise to reestablish connections. If network is recovered, but connections are not, the operating system itself will timeout every one of the remaining connections.

If we're really concerned about knowing which messages were delivered, we'd have a design closer to asio where there is a callback fired when the message is written.

@afck
Copy link
Collaborator

afck commented Mar 10, 2016

I agree that a NetworkDisconnected event would be nice to have but that it shouldn't interfere at all with e. g. TCP timeouts. Its purpose from Crust's point of view would only be to notify the upper layers of the situation so that the UI can display a warning.

I'm not sure whether Routing should act upon it at all, apart from passing the message on to Clients/Vaults. (Maybe it should try reestablishing any connections lost during that time, instead of immediately considering the peers lost.)

@canndrew
Copy link
Contributor

Not sure it's the correct behavior here either.

I wasn't suggesting that we should do those things, merely that we could. They seem to be along the lines of what's being suggested here.

I'm not exactly sure what is being suggested here though. @vinipsmaker is this how you're imagining it?: If a tcp/utp connection disconnects then we fire a LostPeer event. If we detect that a network interface has gone down then we fire a NetworkDisconnected event to say "even though our connections are still up they're not going to work right now and they might be about to go down, so don't send any more messages". Is that about right?

@vinipsmaker
Copy link
Contributor Author

I'm not exactly sure what is being suggested here though. @vinipsmaker is this how you're imagining it?: If a tcp/utp connection disconnects then we fire a LostPeer event. If we detect that a network interface has gone down then we fire a NetworkDisconnected event to say "even though our connections are still up they're not going to work right now and they might be about to go down, so don't send any more messages". Is that about right?

This is exactly what I'm proposing. It's the best thing I've come up with for now to please @Viv-Rajkumar initial request.

@canndrew
Copy link
Contributor

So my current thoughts on this are: Fiddling with the TCP keepalive settings seems like a bad idea. It'll make connections unstable for people experiencing temporary connectivity problems. Also it's impossible to do with OSX or uTP so we'd have to implement one of the other options anyway. The other options are: send regular ping/pong packets on all connections and raise a warning to the upper layers if we stop getting replies. The downside to this is creates overhead as all connections will have a constant (although small) amount of activity on them. The other option is to only raise the warning when an interface goes down. The downside to this is it only works for one side of the connection; the other peer won't know that the connection is unusable. Thoughts?

@vinipsmaker
Copy link
Contributor Author

Thoughts?

I like the ping-pong idea. Maybe we could implement an activity detection and only ping the channel if we haven't received other events. We could have a timeout of 2 seconds to trigger the ping-pong mechanism and a timeout of 10 seconds to trigger the warning to upper layers.

It's worth noticing it's not mutually exclusive with the NetworkDown idea.

@ustulation
Copy link
Contributor

This crate had a major re-write/re-design. Please raise an issue again if required as almost everything in the library and the way it handled things have changed. Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants