Skip to content

Commit

Permalink
jsonrpc: Avoid disconnecting prematurely due to long poll intervals.
Browse files Browse the repository at this point in the history
Open vSwitch has a few different jsonrpc-based protocols that depend on
jsonrpc_session to make sure that the connection is up and working.
In turn, jsonrpc_session uses the "reconnect" state machine to send
probes if nothing is received.  This works fine in normal circumstances.
In unusual circumstances, though, it can happen that the program is
busy and doesn't even try to receive anything for a long time.  Then the
timer can time out without a good reason; if it had tried to receive
something, it would have.

There's a solution that the clients of jsonrpc_session could adopt.
Instead of first calling jsonrpc_session_run(), which is what calls into
"reconnect" to deal with timing out, and then calling into
jsonrpc_session_recv(), which is what tries to receive something, they
could use the opposite order.  That would make sure that the timeout
was always based on a recent attempt to receive something.  Great.

The actual code in OVS that uses jsonrpc_session, though, tends to use
the opposite order, and there are enough users and this is a subtle
enough issue that it could get flipped back around even if we fixed it
now.  So this commit takes a different approach.  Instead of fixing
this in the users of jsonrpc_session, we fix it in the users of
reconnect: make them tell when they've tried to receive something (or
disable this particular feature).

This commit fixes the problem that way.  It's kind of hard to reproduce
but I'm pretty sure that I've seen it a number of times in testing.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
  • Loading branch information
blp committed Dec 21, 2020
1 parent f16b722 commit 77ca5ed
Show file tree
Hide file tree
Showing 4 changed files with 29 additions and 3 deletions.
5 changes: 4 additions & 1 deletion lib/jsonrpc.c
Expand Up @@ -1155,13 +1155,16 @@ jsonrpc_session_recv(struct jsonrpc_session *s)

received_bytes = jsonrpc_get_received_bytes(s->rpc);
jsonrpc_recv(s->rpc, &msg);

long long int now = time_msec();
reconnect_receive_attempted(s->reconnect, now);
if (received_bytes != jsonrpc_get_received_bytes(s->rpc)) {
/* Data was successfully received.
*
* Previously we only counted receiving a full message as activity,
* but with large messages or a slow connection that policy could
* time out the session mid-message. */
reconnect_activity(s->reconnect, time_msec());
reconnect_activity(s->reconnect, now);
}

if (msg) {
Expand Down
25 changes: 23 additions & 2 deletions lib/reconnect.c
Expand Up @@ -61,6 +61,7 @@ struct reconnect {
long long int last_activity;
long long int last_connected;
long long int last_disconnected;
long long int last_receive_attempt;
unsigned int max_tries;
unsigned int backoff_free_tries;

Expand Down Expand Up @@ -109,6 +110,7 @@ reconnect_create(long long int now)
fsm->last_activity = now;
fsm->last_connected = LLONG_MAX;
fsm->last_disconnected = LLONG_MAX;
fsm->last_receive_attempt = now;
fsm->max_tries = UINT_MAX;
fsm->creation_time = now;

Expand Down Expand Up @@ -501,6 +503,19 @@ reconnect_activity(struct reconnect *fsm, long long int now)
fsm->last_activity = now;
}

/* Tell 'fsm' that some attempt to receive data on the connection was made at
* 'now'. The FSM only allows probe interval timer to expire when some attempt
* to receive data on the connection was received after the time when it should
* have expired. This helps in the case where there's a long delay in the poll
* loop and then reconnect_run() executes before the code to try to receive
* anything from the remote runs. (To disable this feature, just call
* reconnect_receive_attempted(fsm, LLONG_MAX).) */
void
reconnect_receive_attempted(struct reconnect *fsm, long long int now)
{
fsm->last_receive_attempt = now;
}

static void
reconnect_transition__(struct reconnect *fsm, long long int now,
enum state state)
Expand Down Expand Up @@ -541,13 +556,19 @@ reconnect_deadline__(const struct reconnect *fsm)
case S_ACTIVE:
if (fsm->probe_interval) {
long long int base = MAX(fsm->last_activity, fsm->state_entered);
return base + fsm->probe_interval;
long long int expiration = base + fsm->probe_interval;
if (fsm->last_receive_attempt >= expiration) {
return expiration;
}
}
return LLONG_MAX;

case S_IDLE:
if (fsm->probe_interval) {
return fsm->state_entered + fsm->probe_interval;
long long int expiration = fsm->state_entered + fsm->probe_interval;
if (fsm->last_receive_attempt >= expiration) {
return expiration;
}
}
return LLONG_MAX;

Expand Down
1 change: 1 addition & 0 deletions lib/reconnect.h
Expand Up @@ -83,6 +83,7 @@ void reconnect_connected(struct reconnect *, long long int now);
void reconnect_connect_failed(struct reconnect *, long long int now,
int error);
void reconnect_activity(struct reconnect *, long long int now);
void reconnect_receive_attempted(struct reconnect *, long long int now);

enum reconnect_action {
RECONNECT_CONNECT = 1,
Expand Down
1 change: 1 addition & 0 deletions tests/test-reconnect.c
Expand Up @@ -48,6 +48,7 @@ test_reconnect_main(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)

now = 1000;
reconnect = reconnect_create(now);
reconnect_receive_attempted(reconnect, LLONG_MAX);
reconnect_set_name(reconnect, "remote");
reconnect_get_stats(reconnect, now, &prev);
printf("### t=%d ###\n", now);
Expand Down

0 comments on commit 77ca5ed

Please sign in to comment.