Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Presence state transitions incorrectly when using a sync worker. #16088

Open
realtyem opened this issue Aug 9, 2023 · 1 comment
Open

Presence state transitions incorrectly when using a sync worker. #16088

realtyem opened this issue Aug 9, 2023 · 1 comment
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@realtyem
Copy link
Contributor

realtyem commented Aug 9, 2023

If presence is enabled, and you have a worker designated to handle the functions of /sync, there is a rapid flickering of various states caused by timeouts and TCP replication pings.

Metric Screenshots

presence-offline-timeout-to-explict-unavailable-setting-online-in-between

presence-offline-timeout-to-explict-unavailable-rapid-change-due-to-sync-or-active-not-updated

Just some examples. One is particularly troublesome.

A little bit of context before we dive right in:
Quite a few clients seem to have their /sync?timeout= set to 30 seconds, including Element.
Synapse has it's SYNC_ONLINE_TIMEOUT set to the same:

# How long to wait until a new /events or /sync request before assuming
# the client has gone.
SYNC_ONLINE_TIMEOUT = 30 * 1000

There seem to be two conditions that lead to this happening, but there may be others.

Condition 1

Here(specifically at Line 451), the condition to send a USER_SYNC TCP replication ping from the Sync Worker to the Presence Writer is checked for being None, essentially. On the very first /sync this will be True and send the ping. All other subsequent /syncs will not send this ping.

def mark_as_coming_online(self, user_id: str) -> None:
"""A user has started syncing. Send a UserSync to the presence writer,
unless they had recently stopped syncing.
"""
going_offline = self.users_going_offline.pop(user_id, None)
if not going_offline:
# Safe to skip because we haven't yet told the presence writer they
# were offline
self.send_user_sync(user_id, True, self.clock.time_msec())

That ping ends up in update_external_syncs_row() on the Presence Writer. This function is primarily used to update the last_user_sync_ts(but we will be revisiting in a moment in Condition 2). Updating the timestamp on each /sync is important, as that is one of the indicators that a client has disappeared and should be marked offline. That check for last_user_sync_ts to be more than SYNC_ONLINE_TIMEOUT is in handle_timeout and looks like this:

# If there are have been no sync for a while (and none ongoing),
# set presence to offline
if user_id not in syncing_user_ids:
# If the user has done something recently but hasn't synced,
# don't set them as offline.
sync_or_active = max(state.last_user_sync_ts, state.last_active_ts)
if now - sync_or_active > SYNC_ONLINE_TIMEOUT:
state = state.copy_and_replace(state=PresenceState.OFFLINE)
changed = True

So, if on the Presence Writer we have not received another ping saying the user is still syncing, the call to handle_timeout will mark a client as offline. On the next /sync, this condition will be updated again to whatever the new state is requested to be. I believe the original intent behind this was only one USER_SYNC should ever be sent and then another would be sent when the client has actually disappeared.

Condition 2

This Condition is loosely based on Condition 1.

  1. The Sync Worker has sent it's initial USER_SYNC ping to the Presence Writer, which updates the last_user_sync_ts as it should.
  2. Things happen normally, but then stop and 5 minutes pass.
  3. The client starts sending unavailable as our presence state, so now the presence state is marked unavailable and is propagated to the rest of the system.
  4. The next call to /sync has nothing happen for 30 seconds, the Presence Writer starts the timing out process outlined above.
  5. The next call to /sync has unavailable still in set_presence, and a new ping is sent and lands in
    update_external_syncs_row() at:
    if is_syncing and user_id not in process_presence:
    if prev_state.state == PresenceState.OFFLINE:
    updates.append(
    prev_state.copy_and_replace(
    state=PresenceState.ONLINE,
    last_active_ts=sync_time_msec,
    last_user_sync_ts=sync_time_msec,
    )
    )

which realizes the old state was offline and then sets the new presence state to online. (Incidentally, this same logic is duplicated in the user_syncing() call in PresenceHandler and I'm not completely convinced it isn't dead code. But that is a separate issue I still have to properly explore.)
6. Another /sync occurs, thus setting the client back to unavailable.

This cycle then repeats every time a /sync is allowed to timeout at the 30 second mark.

Now, having said all that, there are more moving parts here than just those mentioned here: looping calls, other timeouts, latency between connections, processing time when updating the presence state in the database, etc. The looping call that calls the USER_SYNC signaling the sync is done can happen as soon as immediately and as late as UPDATE_SYNCING_USERS_MS(currently 10 seconds) seems to be particularly bothersome extra mechanism that might be influencing this.(On line 1914 above, the condition says if user_id not in syncing_user_ids:, this is the process that takes the user_id out of syncing_user_ids)

I'm of two minds about how to deal with this, and I only explored one of them.

  1. Enforce the behavior of only sending the USER_SYNC on initial connection, and then again when the connection is gone. This will require the Sync Worker to keep track of when the last_user_sync_ts is updated and then be sending either that update or just the offline when it happens to the Presence Writer.
  2. Change the pinging to act as a keep-alive system. I would recommend a new structure to track when the last ping was sent and only send another one if it's been less than SYNC_ONLINE_TIMEOUT minus a few seconds, to allow for latency and processing time.
@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely labels Aug 14, 2023
@realtyem
Copy link
Contributor Author

Now that the multi-device support for presence has landed, I had a chance to revisit this. Largely, this is still a concern but some of the details have changed:

  • set_state() now is setting/updating the last_user_sync_ts, so most of the relevance of sending a TCP replication USER_SYNC is pointless:

    • USER_SYNC start: is now handled by set_state() and can be removed.
    • USER_SYNC end: is used to update the sync timeout one last time, thus giving Synapse an additional 30 seconds(SYNC_ONLINE_TIMEOUT) before timing the user out to offline. Arguably, this can be removed as well(See below in update to Condition 1 for another option).
    • The remainder seems to be about adding the (user_id, device_id) to a structure that is used for deciding if a user has timed out and could be migrated/factored in elsewhere(external_process_last_updated_ms), most likely into the is_sync part of set_state().
  • Condition 1. If the client is syncing, but nothing hits the timeout=30000 so that nothing is returned, condition 1 can still be hit(where a client is marked offline incorrectly, just to be brought right back).

    • Perhaps extracting the timeout used for the call to /sync and use that as the actual SYNC_ONLINE_TIMEOUT instead of a hardwired 30 seconds would be appropriate? Then if the client decides in the future that 5 minutes needs to be the new timeout, it would be automagically handled. Could be a simple formula:
      now > (last_user_sync_ts + client_req_sync_timeout + SYNC_ONLINE_TIMEOUT) then user=offline or something like that.
  • Condition 2. For the moment, this hasn't shown itself occurring again(even when running back in time to Element 1.11.37, where set_presence is set on /sync). This is harder to hit, I'll be watching for this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

2 participants