This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Presence state transitions incorrectly when using a sync worker. #16088
Labels
O-Occasional
Affects or can be seen by some users regularly or most users rarely
S-Minor
Blocks non-critical functionality, workarounds exist.
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
If presence is enabled, and you have a worker designated to handle the functions of
/sync
, there is a rapid flickering of various states caused by timeouts and TCP replication pings.Metric Screenshots
Just some examples. One is particularly troublesome.
A little bit of context before we dive right in:
Quite a few clients seem to have their
/sync?timeout=
set to 30 seconds, including Element.Synapse has it's
SYNC_ONLINE_TIMEOUT
set to the same:synapse/synapse/handlers/presence.py
Lines 108 to 111 in 0328b56
There seem to be two conditions that lead to this happening, but there may be others.
Condition 1
Here(specifically at Line 451), the condition to send a
USER_SYNC
TCP replication ping from the Sync Worker to the Presence Writer is checked for beingNone
, essentially. On the very first/sync
this will beTrue
and send the ping. All other subsequent/sync
s will not send this ping.synapse/synapse/handlers/presence.py
Lines 446 to 455 in 0328b56
That ping ends up in
update_external_syncs_row()
on the Presence Writer. This function is primarily used to update thelast_user_sync_ts
(but we will be revisiting in a moment in Condition 2). Updating the timestamp on each/sync
is important, as that is one of the indicators that a client has disappeared and should be markedoffline
. That check forlast_user_sync_ts
to be more thanSYNC_ONLINE_TIMEOUT
is inhandle_timeout
and looks like this:synapse/synapse/handlers/presence.py
Lines 1912 to 1920 in 0328b56
So, if on the Presence Writer we have not received another ping saying the user is still syncing, the call to
handle_timeout
will mark a client asoffline
. On the next/sync
, this condition will be updated again to whatever the new state is requested to be. I believe the original intent behind this was only oneUSER_SYNC
should ever be sent and then another would be sent when the client has actually disappeared.Condition 2
This Condition is loosely based on Condition 1.
USER_SYNC
ping to the Presence Writer, which updates thelast_user_sync_ts
as it should.unavailable
as our presence state, so now the presence state is markedunavailable
and is propagated to the rest of the system./sync
has nothing happen for 30 seconds, the Presence Writer starts the timing out process outlined above./sync
hasunavailable
still inset_presence
, and a new ping is sent and lands inupdate_external_syncs_row()
at:synapse/synapse/handlers/presence.py
Lines 1094 to 1102 in 0328b56
which realizes the old state was
offline
and then sets the new presence state toonline
. (Incidentally, this same logic is duplicated in theuser_syncing()
call inPresenceHandler
and I'm not completely convinced it isn't dead code. But that is a separate issue I still have to properly explore.)6. Another
/sync
occurs, thus setting the client back tounavailable
.This cycle then repeats every time a
/sync
is allowed to timeout at the 30 second mark.Now, having said all that, there are more moving parts here than just those mentioned here: looping calls, other timeouts, latency between connections, processing time when updating the presence state in the database, etc. The looping call that calls the
USER_SYNC
signaling the sync is done can happen as soon as immediately and as late asUPDATE_SYNCING_USERS_MS
(currently 10 seconds) seems to be particularly bothersome extra mechanism that might be influencing this.(On line 1914 above, the condition saysif user_id not in syncing_user_ids:
, this is the process that takes theuser_id
out ofsyncing_user_ids
)I'm of two minds about how to deal with this, and I only explored one of them.
USER_SYNC
on initial connection, and then again when the connection is gone. This will require the Sync Worker to keep track of when thelast_user_sync_ts
is updated and then be sending either that update or just theoffline
when it happens to the Presence Writer.SYNC_ONLINE_TIMEOUT
minus a few seconds, to allow for latency and processing time.The text was updated successfully, but these errors were encountered: