Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServerHandler: reconnect to the server if it is not responding to TCP pings #2622

Merged
merged 1 commit into from Nov 7, 2016

Conversation

@mkrautz
Copy link
Member

commented Nov 5, 2016

This is implemened via a 'max in-flight TCP pings' counter, and
a rule for how many in-flight TCP pings there can be at one
point in time. (The default being 2, which means a client will
disconnect from a server if it takes the server more than 10
seconds to reply to a single ping message, given that Mumble
sends ping messages every 5 seconds.)

The idea is simple: ServerHandler keeps a counter of the number
of TCP pings it has sent, but has not yet received a response for.

When ServerHandler sends a ping, it increments the counter.

Once a pong or response for the ping has been received, it
decrements the counter.

But most importantly, ServerHandler now checks the amount
of in-flight TCP pings before it sends a new ping message.
If the counter exceeds the maximum number of allowed in-flight
TCP pings, it disconects from the server, allowing the client
to try to re-establish the connection.

One scenario where this is useful, is for IPv6 privacy addresses.
Privacy addresses are temporary, often with a default expiry time
of 4 hours. After this time, the address is removed from the system,
and TCP connections using that address will be terminated.
(See for example #1377)

One would naturally expect it to be possible to observe the fact
that the TCP connection has been terminated, but that doesn't seem
to be what's happening. At times, I've observed it taking up
to 30 minutes until my client realized that the connection wasn't
there anymore. (However, the server that I was connected to
found out immediately, and disconneted me.)

Because of this behavior, this TCP ping check was implemented.

@Kissaki
Copy link
Member

left a comment

Small typo. Rest is discussion/questions.

/// sent, but not yet recieved a response for
/// from the server. This value is checked when
/// the client sends its next ping message. If
/// The maximum is reached, the connection will

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 5, 2016

Member

Typo The should be the.

/// number of ping messages that the client has
/// sent, but not yet recieved a response for
/// from the server. This value is checked when
/// the client sends its next ping message. If

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 5, 2016

Member

Maybe we should make it clear that this is about periodic pings? Also in the commit message.

Although we probably don’t do so right now, pings could also be sent in other cases, where this count should not apply.

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 5, 2016

Author Member

Hmm. But we don't have non-periodic pings. I'm not sure what you're asking for here. s/pings/periodic pings/? Or just clarify that the feature is about the client-initiated TCP pings that happen every ~5s?

@@ -644,6 +645,7 @@ void Settings::load(QSettings* settings_ptr) {
SAVELOAD(iMaxImageWidth, "net/maximagewidth");
SAVELOAD(iMaxImageHeight, "net/maximageheight");
SAVELOAD(qsServicePrefix, "net/serviceprefix");
SAVELOAD(iMaxInFlightTCPPings, "net/maxinflighttcppings");

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 5, 2016

Member

Given the error message we use (Server is not responding to TCP pings), maybe we should check the setting we load is > 0? Or do we expect user who hack registry to know what they are doing? Would be fine by me.

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 5, 2016

Author Member

Yes, I actually wanted to implement something like this.

How about treating "net/maxinflighttcppings" <= 0 as disabling the feature?

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 6, 2016

Member

Sure, why not.

@@ -454,6 +462,8 @@ void ServerHandler::message(unsigned int msgType, const QByteArray &qbaMsg) {
ConnectionPtr connection(cConnection);
if (!connection) return;

iInFlightTCPPings -= 1;

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 5, 2016

Member

I don’t really like that we handle every ping message here in message, but not increase the counter in the counterpart sendMessage. But the current code does not seem to be symmetric at all, so that’s just how it is.

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 5, 2016

Author Member

Well, pings are restricted to ServerHandler: they're only ever sent via sendPing(), and the message handler lives in ServerHandler, (as opposed to Messages/MainWindow). So I think the current structure is OK.

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 6, 2016

Member

Resetting rather than decrementing the counter (as discussed in IRC) would make this "asymmetric decrement" less dangerous/problematic.

@@ -392,6 +393,11 @@ void ServerHandler::sendPing() {
if (!connection)
return;

if (iInFlightTCPPings >= g.s.iMaxInFlightTCPPings) {
serverConnectionClosed(QAbstractSocket::UnknownSocketError, tr("Server is not responding to TCP pings"));

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 5, 2016

Member

So we don’t close the QSslSocket via connection->disconnect(true) because we expect it to be dead already/nothing to clean up?
If we as a client are not aware of the dropped TCP connection, shouldn’t we abort the QSslSocket?

That would also emit the closed signal - in a cleaner way (responsibility of the connection to emit the closed signal).

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 5, 2016

Author Member

What I wanted to achieve here was to disconnect with a "reason".

Without a reason, the auto-reconnect code will not reconnect. (It treats empty reason and UnknownSocketError as "forced disconnect", I guess).

Also, the reason is nice to have in the disconnection message.

I initially had the same code here is used for timeout-during-connect (https://github.com/mkrautz/mumble/blob/e5da2ba6daa7c77823c2a47adb6476c584992d0f/src/mumble/ServerHandler.cpp#L525-L528). However, that turned out to not print any text, nor did it try to auto-reconnect.

What the current code does is pretty much equivalent to what ServerHandler::disconnect() does: it calls exit(0), which causes the connection to be torn down: https://github.com/mkrautz/mumble/blob/e5da2ba6daa7c77823c2a47adb6476c584992d0f/src/mumble/ServerHandler.cpp#L315-L340

Would you be happier if I introduced a disconnectWithReason(QString reason), or added a QString reason parameter to ServerHandler's disconnect method?

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 5, 2016

Author Member

I'd just like to add:

Calling ServerHandler::disconnect() sends a custom event, which is handled at https://github.com/mkrautz/mumble/blob/e5da2ba6daa7c77823c2a47adb6476c584992d0f/src/mumble/ServerHandler.cpp#L133

The disconnect method simply calls exit(0) to terminate the ServerHandler thread.

I'm happy to instead add a reason parameter to disconnect (or a disconnectWithReason method) -- as mentioned above.

That was the route I was taking until I found existing code that just called the serverConnectionClosed slot.

This comment has been minimized.

Copy link
@Kissaki

Kissaki Nov 6, 2016

Member

I think we should remove the disconnect method and custom even exit logic.
For the one case it is used (connected to server has no certificate) we don’t have to close Mumble, but just the Connection.

The Connection signal connectionClosed already has a reason parameter, which it passes to/is connected to the slot serverConnectionClosed.
We could thus add the (optional) reason paramter to the disconnectSocket method for our use-case - disconnecting the (no-longer-existant) connection with a reason.
I don’t get the check in disconnectSocket for emitting the connectionClosed though. Alternatively we could add a new method Connection::disconnect(reason) (or similar).


My thought process/details. Nvm if the above is enough.

serverConnectionClosed is a slot. Thus, it should be called (only) as a reaction to signals/events.

In serverConnectionTimeoutOnConnect it is called as a followup event/slot.

Just from the wording, without having checked the code, adding a parameter to the disconnect method feels better. disconnect is a verb, an "action method" rather than a slot/event.

Looking at the code, maybe we want to call disconnect/abort on Connection instead? Which would emit connectionClosed, which is connected to serverConnectionClosed.
Of course, we would have pass the reason through all of that. The connectionClosed signal already has/uses that reason parameter.

The ServerHandler::disconnect() method is only called in ServerHandler::serverConnectionConnected(), with a valid Connection object in place. So I guess the ServerHandler::disconnect() should be removed. We can close the connection on the Connection object - should be its responsibility.

This comment has been minimized.

Copy link
@mkrautz

mkrautz Nov 6, 2016

Author Member

[...] and custom even exit logic.
For the one case it is used (connected to server has no certificate) we don’t have to close Mumble, but just the Connection.

Hmm? This isn't ::exit(0), it's QThread::exit(), which makes the ServerHandler thread terminate.

@Kissaki

This comment has been minimized.

Copy link
Member

commented Nov 5, 2016

At times, I've observed it taking up
to 30 minutes until my client realized that the connection wasn't
there anymore.

wtf, that sucks. Especially when it is the client side OS that drops the connection/changes IP. Not sure what to make of that :-/

@bmwiedemann

This comment has been minimized.

Copy link

commented Nov 6, 2016

When thinking about the design - what would happen on a lossy connection where every 100th ping got lost - wouldnt that cause an auto-reconnect after the 200th (aka 2nd lost) ping, even though the connection is alive and there is a constant stream of pings arriving? IMHO what you really want to achieve is to have the last received ping not being > X seconds old. Or have+use something like the sequence numbers in ICMP echo request/response

@mkrautz

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2016

@bmwiedemann It's worth noting that the pings in question are simply messages in a TCP stream, so a "lost" ping would delay the stream, not really get lost. If it got truly "lost", that would suggest a server bug.

(The Murmur behavior for responding to a TCP ping is to respond ASAP: https://github.com/mumble-voip/mumble/blob/master/src/murmur/Messages.cpp#L1429)

But I think what should probably happen here is that the "max in-flight pings" counter is reset on every ping that's received, not just decremented.

That way, the solution is "X consecutive pings without an answer".

While the end result would probably be the same for a TCP socket that's actually dead, perhaps its more readable and easier to reason about.

@bmwiedemann

This comment has been minimized.

Copy link

commented Nov 6, 2016

you are right, I forgot about the tcp retransmits making sure that lost pings eventually arrive. So should be fine either way.

ServerHandler: reconnect to the server if it is not responding to TCP…
… pings

This is implemened via a 'max in-flight TCP pings' counter, and
a rule for how many in-flight TCP pings there can be at one
point in time. (The default being 2, which means a client will
disconnect from a server if it takes the server more than 10
seconds to reply to a single ping message, given that Mumble
sends ping messages every 5 seconds.)

The idea is simple: ServerHandler keeps a counter of the number
of TCP pings it has sent, but has not yet received a response for.

When ServerHandler sends a ping, it increments the counter.

Once a pong or response for the ping has been received, it
resets the counter to 0.

But most importantly, ServerHandler now checks the amount
of in-flight TCP pings before it sends a new ping message.
If the counter exceeds the maximum number of allowed in-flight
TCP pings, it disconects from the server, allowing the client
to try to re-establish the connection.

One scenario where this is useful, is for IPv6 privacy addresses.
Privacy addresses are temporary, often with a default expiry time
of 4 hours. After this time, the address is removed from the system,
and TCP connections using that address will be terminated.
(See for example #1377)

One would naturally expect it to be possible to observe the fact
that the TCP connection has been terminated, but that doesn't seem
to be what's happening. At times, I've observed it taking up
to 30 minutes until my client realized that the connection wasn't
there anymore. (However, the server that I was connected to
found out immediately, and disconneted me.)

Because of this behavior, this TCP ping check was implemented.

@mkrautz mkrautz force-pushed the mkrautz:max-in-flight-tcp-pings branch from e5da2ba to 9fc379a Nov 6, 2016

@mkrautz

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2016

Made the discussed changes, but I still call serverConnectionClosed to terminate the connection.
If possible, I'd like to look at cleaning up the various ways of tearing down ServerHandler in a separate PR.

@mkrautz

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2016

@Kissaki PTAL

@mkrautz

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2016

Also, I think that we should be careful with the current logic as @Kissaki mentions.

I had Mumble open in a VM, trying to connect to an IP that no longer hosted a server.
I got the usual errors for a while. I don't remember what it was: unable to connect?

Then, after 5-6 of them, I began getting "unable to send TCP pings to the server".
I guess we start pinging before we should?

@Kissaki

This comment has been minimized.

Copy link
Member

commented Nov 7, 2016

I guess we start pinging before we should?

ServerHandler::run() sets up the recurring ping timer and immediately starts it. It does so after setting up the socket and initiating the connect (qtsSock->connectToHostEncrypted(qsHostName, usPort)), but without any adequate handling of no-success cases. The timer is local to the run() function; it is never paused on connection issues.
(Not sure where the (old) reconnect code is though.)

So yes, we do start pinging and continue to ping in inadequate cases. We should fix that (in a separate PR).

I created #2627 for further analysis/work.

@Kissaki
Kissaki approved these changes Nov 7, 2016
Copy link
Member

left a comment

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.