Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Presence fix for idle mode #15989

Closed
1 change: 1 addition & 0 deletions changelog.d/15989.bugfix
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Allow presence to achieve idle state.
61 changes: 37 additions & 24 deletions synapse/handlers/presence.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
The methods that define policy are:
- PresenceHandler._update_states
- PresenceHandler._handle_timeouts
- PresenceHandler.set_state
- should_notify
"""
import abc
Expand Down Expand Up @@ -444,20 +445,18 @@ def send_user_sync(self, user_id: str, is_syncing: bool, last_sync_ms: int) -> N
)

def mark_as_coming_online(self, user_id: str) -> None:
"""A user has started syncing. Send a UserSync to the presence writer,
unless they had recently stopped syncing.
"""
going_offline = self.users_going_offline.pop(user_id, None)
if not going_offline:
# Safe to skip because we haven't yet told the presence writer they
# were offline
self.send_user_sync(user_id, True, self.clock.time_msec())
"""A user has started syncing. Send a UserSync to the presence writer."""
# Need to remove the user_id from users_going_offline or the looping call will
# tell the presence writer that the user went offline.
self.users_going_offline.pop(user_id, None)
self.send_user_sync(user_id, True, self.clock.time_msec())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following why we remove the if-statement here. Can you explain a bit more what this is doing?

Copy link
Contributor Author

@realtyem realtyem Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that above in Number 2 under Three-fold fix.
Some context to keep in mind before I begin the long version:

  • This is only applies on a worker model of synapse where there is either a sync worker or a presence writer(at the very least). This does not apply on a monolith deployment.
  • syncs can happen sometimes in rapid or sometime in slow succession, depending on how many other users the homeserver is aware exist + any room updates coming along
  • There is a looping call every 10 seconds, that will mark a user as 'offline' if they are in users_going_offline and at least 10 seconds has passed. These timers are not synchronized in any way and a theoretical 19 seconds could occur. The looping call uses the companion to send_user_sync() called send_stop_syncing() which makes an explicit call to the presence writer to mark them 'offline'.
  • This is tied to the context manager that wraps a call to the sync handler here

Now

  1. At the _end() of the sync context manager, it used to add user_id as a key to users_going_offline with a value of the current time in milliseconds.
  2. On the next sync, user_id is pop()ped off of users_going_offline and checked for None(or more specifically, would check that there was not a value), and it would only then send_user_sync(). This only occurs when it is the very first sync of a recent-ish session otherwise it will not be None and send_user_sync() would not be called.
  3. So, we are mid sync and the presence writer has not been informed we are still around. Finish the context manager, and repeat however many times it takes to get to about 10 seconds(specifically the next time the looping call comes around to call send_stop_syncing() and mark them offline).
  4. 10 seconds has passed, the context manager is still doing it's thing and knows the user is still online(obviously, they are still mid-sync after all). Over on the presence writer however, no indication has happened in 10 seconds that the user is still online: no further USER_SYNC has come over TCP replication, and remember that set_state() no longer bumps last_active_ts on a regular basis anymore. The machinery kicks into gear to mark the user as 'offline', which then gets sent over replication back to the sync handler side.
  5. At this point, the presence writer thinks the user is 'offline' and the sync handler(well, it's associated WorkerPresenceHandler, actually) thinks the user is 'online'. The call from the presence writer will overwrite what the context manager thinks at this point. Still no update to the last_active_ts, by the way.
  6. On the next sync, the context manager saw they were 'offline' last and puts the user back to 'online', beginning the cycle again. You can see this in metrics under the spoiler arrow with the words "One without Two resulted in this bad dream(every 35 seconds)".

Now, you are probably thinking, "Well, let's just remove the not in the conditional to check for an actual time". I considered that, but then the presence writer doesn't get it's first notification that the user is syncing and that felt like undefined behavior. Honestly, it might work out(and is a place I'm planning on revisiting in the future as a bit of refactor, there are more things to touch here). I will test this if you wish, my instinct says as long as the second sync of a session is within 10 seconds then everything will be fine.

And, to be fair, fully fixing everything possible wasn't the scope of this PR. I wanted 'unavailable' to work as advertised. But this was affecting the behavior in a negative way and wasn't going to be allowed to get away, a traditional 'bug' that was fully exposed by last_active_ts not being updated on every single sync any longer.


def mark_as_going_offline(self, user_id: str) -> None:
"""A user has stopped syncing. We wait before notifying the presence
writer as its likely they'll come back soon. This allows us to avoid
sending a stopped syncing immediately followed by a started syncing
notification to the presence writer
"""A user has stopped syncing. We allow the timeout(UPDATE_SYNCING_USERS_MS)
to expire and the looping_call to handle notifying the presence writer as its
likely they'll come back soon. This allows us to avoid sending a stopped
syncing immediately followed by a started syncing notification to the
presence writer
"""
self.users_going_offline[user_id] = self.clock.time_msec()

Expand Down Expand Up @@ -486,11 +485,12 @@ async def user_syncing(
if prev_state.state != PresenceState.BUSY:
# We set state here but pass ignore_status_msg = True as we don't want to
# cause the status message to be cleared.
# Note that this causes last_active_ts to be incremented which is not
# what the spec wants: see comment in the BasePresenceHandler version
# of this function.
# Note that this causes last_active_ts to be incremented only if changing
# state to 'online' from 'offline'.
await self.set_state(
UserID.from_string(user_id), {"presence": presence_state}, True
UserID.from_string(user_id),
{"presence": presence_state},
ignore_status_msg=True,
)

curr_sync = self._user_to_num_current_syncs.get(user_id, 0)
Expand Down Expand Up @@ -1004,13 +1004,12 @@ async def user_syncing(
# If they're busy then they don't stop being busy just by syncing,
# so just update the last sync time.
if prev_state.state != PresenceState.BUSY:
# XXX: We set_state separately here and just update the last_active_ts above
# This keeps the logic as similar as possible between the worker and single
# process modes. Using set_state will actually cause last_active_ts to be
# updated always, which is not what the spec calls for, but synapse has done
# this for... forever, I think.
# process modes.
await self.set_state(
UserID.from_string(user_id), {"presence": presence_state}, True
UserID.from_string(user_id),
{"presence": presence_state},
ignore_status_msg=True,
)
# Retrieve the new state for the logic below. This should come from the
# in-memory cache.
Expand Down Expand Up @@ -1254,9 +1253,12 @@ async def set_state(
if not ignore_status_msg:
new_fields["status_msg"] = status_msg

if presence == PresenceState.ONLINE or (
presence == PresenceState.BUSY and self._busy_presence_enabled
):
if (
prev_state.state == PresenceState.OFFLINE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm failing to follow why this is specifically for offline->online; naively I would have expected if it is going from !online->online.

(The busy presence bit seems like it might have the same bug?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a reason, which is entirely escaping me right now(and appears to be missing from my notes). Let me test it out, because if this holds up we can remove the 'hacky' bit in handle_update(). I think it was related to not using set_presence on the /sync query, as it defaults to 'online'. However, that behavior has recently changed(at least for Element, other clients may not have). Give me some time to look into it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! The various possible state transitions are rather fuzzy in my head, which is why I'm asking. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made those changes. No deal(at least as-is). It will need something else to handle the fact that some clients still set presence via the presence endpoint and not through /sync.

https://github.com/matrix-org/synapse/actions/runs/5745219286

presence_temp

I left my desktop client open(so it should have been syncing in the background, at least until it went to sleep). But then I checked in real quick with my Android mobile, a few minutes later this nonsense started spamming federation.

and presence == PresenceState.ONLINE
) or (presence == PresenceState.BUSY and self._busy_presence_enabled):
# By updating last_active_ts in this way, currently_active will be triggered
# in handle_update()
new_fields["last_active_ts"] = self.clock.time_msec()

await self._update_states(
Expand Down Expand Up @@ -1954,6 +1956,9 @@ def handle_update(
- persist_and_notify: whether to persist and notify people
- federation_ping: whether we should send a ping over federation
"""
# Note: if the previous state was OFFLINE and we are now ONLINE, this is most likely
# caused by a call to the /sync endpoint. If that is the case, last_active_ts would
# have been set in PresenceHandler.set_state()
user_id = new_state.user_id

persist_and_notify = False
Expand All @@ -1977,6 +1982,14 @@ def handle_update(
obj=user_id,
then=new_state.last_active_ts + LAST_ACTIVE_GRANULARITY,
)
if prev_state.state == PresenceState.UNAVAILABLE:
# If we had idled out, but are still syncing, then new state will look
# like it's online. Verify with last_active_ts, as that is a sign of a
# pro-active event. Override that we are online when we probably aren't.
if now - new_state.last_active_ts > IDLE_TIMER:
new_state = new_state.copy_and_replace(
state=PresenceState.UNAVAILABLE
)
Comment on lines +1985 to +1992
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky -- can we avoid trying to update in this case at all instead backtracking from saying we're online to unavailable?


if new_state.state != PresenceState.OFFLINE:
# User has stopped syncing
Expand Down
57 changes: 57 additions & 0 deletions tests/handlers/test_presence.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,37 @@ def test_online_to_idle(self) -> None:
any_order=True,
)

def test_idle_while_syncing_stays_idle(self) -> None:
"""Test that syncing alone doesn't affect unavailable state"""
wheel_timer = Mock()
user_id = "@foo:bar"
now = 5000000
time_went_idle = now - IDLE_TIMER - 1

prev_state = UserPresenceState.default(user_id)
prev_state = prev_state.copy_and_replace(
state=PresenceState.UNAVAILABLE,
last_active_ts=time_went_idle,
currently_active=False,
)

# The key to the test here is that while syncing, we are being told we are
# 'online'. But if the last_active_ts hasn't moved, then we are actually still
# idle.
new_state = prev_state.copy_and_replace(state=PresenceState.ONLINE)

state, persist_and_notify, federation_ping = handle_update(
prev_state, new_state, is_mine=True, wheel_timer=wheel_timer, now=now
)

self.assertFalse(persist_and_notify)
self.assertEqual(prev_state.state, state.state)
self.assertEqual(state.last_federation_update_ts, now)
self.assertEqual(prev_state.state, state.state)
self.assertEqual(prev_state.status_msg, state.status_msg)

self.assertEqual(wheel_timer.insert.call_count, 2)

def test_persisting_presence_updates(self) -> None:
"""Tests that the latest presence state for each user is persisted correctly"""
# Create some test users and presence states for them
Expand Down Expand Up @@ -773,6 +804,32 @@ def test_set_presence_from_syncing_keeps_busy(
# we should still be busy
self.assertEqual(state.state, PresenceState.BUSY)

def test_syncing_appropriately_handles_last_active_ts(self) -> None:
"""Test that syncing does not bump last_active_ts when already online"""

user_id = "@test:server"

# Start from a time that is not zero
self.reactor.advance(2)

self._set_presencestate_with_status_msg(user_id, PresenceState.ONLINE, None)

prev_state = self.get_success(
self.presence_handler.get_state(UserID.from_string(user_id))
)

# Poke the reactor so some time passes
self.reactor.advance(2)
self.get_success(
self.presence_handler.user_syncing(user_id, True, PresenceState.ONLINE)
)

state = self.get_success(
self.presence_handler.get_state(UserID.from_string(user_id))
)
# last_active_ts should not have changed as no pro-active event has occurred
self.assertEqual(prev_state.last_active_ts, state.last_active_ts)
Comment on lines +830 to +831
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston and I went through this and we think there's a misunderstanding here -- last_active_ts should be updated for each sync. It seems like not updating this except for the initial offline -> online transition is fixing the bug, at the expense of never updating last_active_ts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading just that summary that sounds like a recipe for disaster.

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

Copy link
Member

@clokep clokep Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

My understanding is that updating the last_active_ts should not cause new data to be sent down sync while you're actively syncing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes

https://spec.matrix.org/v1.7/client-server-api/#last-active-ago

To reduce the number of presence updates sent to clients the server may include a currently_active boolean field when the presence state is online. When true, the server will not send further updates to the last active time until an update is sent to the client with either a) currently_active set to false or b) a presence state other than online. During this period clients must consider the user to be currently active, irrespective of the last active time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear

There is more to do in this area, but will be on a different PR as it may be somewhat contentious.

I wasn't done in this area. This doesn't completely fix that last_active_ts is being sent inappropriately, just that part that was causing it to be affecting idle mode(I'm trying to stay inside the scope of the point of the PR. The TCP replication was one of the things that was masking and keeping this from working as well, or it would also have been separate).

To fix last_active_ts being updated inappropriately, only really means that if currently_active is True we don't have to send anything if the only thing to send is an update to last_active_ts. The only time last_active_ts should be sent is when currently_active is False and that update should be changing currently_active to True on the same packet.

State changes are separate from this, and this PR is only supposed to be about making the state change work. I'm sorry this is making problems. I can close this and separate it all out if you would rather. (But I'm not updating the pictures, those were a pain to do the first time)


def _set_presencestate_with_status_msg(
self, user_id: str, state: str, status_msg: Optional[str]
) -> None:
Expand Down
Loading