Presence fix for idle mode #15989

realtyem · 2023-07-24T12:01:17Z

Presence has an idle mode, called 'Unavailable' which currently doesn't work right and is generating some additional traffic. Make it behave.

This is what it looks like if you zoom in to a 1 second resolution.

And it's repetitive. This is my client sitting open and largely untouched.

But it doesn't ever actually idle. It goes back to online in nearly the same second. For testing this PR and for all the following pictures, I parked myself into an empty room so it would not be contaminated by read receipts, read markers, or typing (and of course didn't send myself any messages) as all of those will bump last_active_ts and not allow idling. I also ramped up the scraping interval and changed Grafana's display points to use $__rate_interval so I could get 1 second resolution on the timing(which makes the scale weird). The point was that I could see how often it was happening.

A client/user is considered idle after 5 minutes of no 'pro-active' events or activity. As you can see, it's not quite....right.

The fix is three-fold

Fix the last_active_ts bug, which allows a user to not update their last_active_ts just by syncing.
Stop a race between TCP replication calls to USER_SYNC in which one would say we are 'online' and another would say we are 'offline', creating a flicker(@nico-famedly). This appeared to only occur on a worker-mode dance between a sync worker and a presence stream_writer, but would still propagate over federation as it was a change of state. I want to thank @anoadragon453 for putting on an archeology hat and providing the fix for this. I don't think I would have looked here if it wasn't for that.
Add a condition that checks that a PresenceState was in 'unavailable'(which is what idle is called by the spec) and verifies that per the last_active_ts to see if the call to be 'online' is accurate. I want to thank @jevolk for letting me borrow his homeserver to see what was happening on the other side of federation.

One without Two resulted in this bad dream(every 35 seconds):

Just the yellow one on the left, the orange one on the right we'll deal with in Three.

I called it after 30 minutes that this was enough of that.

One and Two without Three gave this nightmare(which didn't start for over 5 minutes, but then every 5 seconds):

Close up on these is up above.

And I threw in some tests to check this stuff. They pass now but on develop you get:

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "/home/jason/matrix-org/synapse/tests/handlers/test_presence.py", line 829, in test_syncing_appropriately_handles_last_active_ts
    self.assertEqual(prev_state.last_active_ts, state.last_active_ts)
  File "/home/jason/.cache/pypoetry/virtualenvs/matrix-synapse-8koxSNBY-py3.10/lib/python3.10/site-packages/twisted/trial/_synctest.py", line 441, in assertEqual
    super().assertEqual(first, second, msg)
  File "/usr/lib/python3.10/unittest/case.py", line 845, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib/python3.10/unittest/case.py", line 838, in _baseAssertEqual
    raise self.failureException(msg)
twisted.trial.unittest.FailTest: 2000 != 4000

tests.handlers.test_presence.PresenceHandlerTestCase.test_syncing_appropriately_handles_last_active_ts
===============================================================================
[FAIL]
Traceback (most recent call last):
  File "/home/jason/matrix-org/synapse/tests/handlers/test_presence.py", line 330, in test_idle_while_syncing_stays_idle
    self.assertFalse(persist_and_notify)
  File "/home/jason/.cache/pypoetry/virtualenvs/matrix-synapse-8koxSNBY-py3.10/lib/python3.10/site-packages/twisted/trial/_synctest.py", line 386, in assertFalse
    super().assertFalse(condition, msg)
  File "/usr/lib/python3.10/unittest/case.py", line 681, in assertFalse
    raise self.failureException(msg)
twisted.trial.unittest.FailTest: True is not false

tests.handlers.test_presence.PresenceUpdateTestCase.test_idle_while_syncing_stays_idle

(For those that were paying attention, you may have noticed that the original version of this PR had the conditional to fix last_active_ts check against != PresenceState.ONLINE. This was affecting using idle, so was changed to exclude it.)

Hey look, it works now

Correct idling federation traffic:

Spike one is coming online
Spike two is currently_active going false
Spike three is going idle(unavailable)
Spike four is the federation ping

Output provided by jevolk's server(does not match up with metrics above, but was used to verify unavailable state)

Sun, 23 Jul 2023 06:30:23 +0000 {"currently_active":true,"last_active_ago":464,"presence":"online","user_id":"@<me>"} $eWtdp-Km0YduwJzMdsYcAhsOC4BteBzBpNDbWNTELQQ
Sun, 23 Jul 2023 06:22:56 +0000 {"currently_active":false,"last_active_ago":305634,"presence":"unavailable","user_id":"@<me>"} $pogN1n4O2OHAEj5LIF_BHKd435zyFZIQNWSNhg9P0Ag
Sun, 23 Jul 2023 06:18:56 +0000 {"currently_active":false,"last_active_ago":65630,"presence":"online","user_id":"@<me>"} $fqAhopBuouGn4bnl6WgCk-HzMjrBEx4j4tpNAtP_Km0
Sun, 23 Jul 2023 06:17:45 +0000 {"currently_active":true,"last_active_ago":347,"presence":"online","user_id":"@<me>"} $W4LcbnApxYcPz9KWkwoomdUZfv3USqpc6VVlgY1AOhg
Sun, 23 Jul 2023 06:16:01 +0000 {"currently_active":false,"last_active_ago":303489,"presence":"unavailable","user_id":"@<me>"} $frgWplfuLdWx8rKxPlpyQ3OXTAMVJaaELNYlYvL1-ZA
Sun, 23 Jul 2023 06:12:01 +0000 {"currently_active":false,"last_active_ago":62724,"presence":"online","user_id":"@<me>"} $A0sT_8xg-UbVUEvi8Sk0e7YnlvmWubPwsO9E9VTkRlw
Sun, 23 Jul 2023 06:10:16 +0000 {"currently_active":true,"last_active_ago":168,"presence":"online","user_id":"@<me>"} $WCTjiYvvJMUC5prdZGP01STaSAEp8qrApxGZ9Udcxso
Sun, 23 Jul 2023 06:09:56 +0000 {"currently_active":false,"last_active_ago":63547,"presence":"online","user_id":"@<me>"} $lw3ujOfLF7cZtMhqAPse1mA9uB-ffevqqcln5JJqg-M
Sun, 23 Jul 2023 06:04:16 +0000 {"currently_active":true,"last_active_ago":235,"presence":"online","user_id":"@<me>"} $sFl_bBDjpr3ZyZyEz7wrjsr7N4iAcd8iny05oEKxxJE
Sun, 23 Jul 2023 06:03:31 +0000 {"currently_active":false,"last_active_ago":62517,"presence":"online","user_id":"@<me>"} $80BGZnbXSg6Tj_d-gAMz_uHp97XUFNXpsjprnjVmV8g
Sun, 23 Jul 2023 06:01:35 +0000 {"currently_active":true,"last_active_ago":126,"presence":"online","user_id":"@<me>"} $rENTSkRYg_HYwtJI0UK1ubjf4hJ0ckvFOLQPOJrOnWU
Sun, 23 Jul 2023 06:01:28 +0000 {"currently_active":false,"last_active_ago":62886,"presence":"online","user_id":"@<me>"} $NxA6uvC6ekLR1plHtjLC3GNbF5SPZ0wyDPsPfeiCbnw
Sun, 23 Jul 2023 05:58:07 +0000 {"currently_active":true,"last_active_ago":264,"presence":"online","user_id":"@<me>"} $n4ebVHKlrlPtFoofgISFmiUQ4NAbsOHqEZYNz7rzccI
Sun, 23 Jul 2023 05:50:58 +0000 {"currently_active":true,"last_active_ago":2761,"presence":"online","user_id":"@<me>"} $jRJWtnex4F-5dkVs9YO3VG4sHK6Myi2cBcgfXoaPAYg
Sun, 23 Jul 2023 05:41:11 +0000 {"currently_active":true,"last_active_ago":3188,"presence":"online","user_id":"@<me>"} $y3p-j2XIjoAtAONLo-MDKPsGB0dJe2cmOxA0AXFn2sY
Sun, 23 Jul 2023 05:14:49 +0000 {"currently_active":false,"last_active_ago":98635,"presence":"online","user_id":"@<me>"} $GbeWx4z3FhbkKQfkNS37sIhJvRLopGUCGKllPlPTOQk

There is more to do in this area, but will be on a different PR as it may be somewhat contentious.

Should...

Fixes last_active_ago is updated on sync when it shouldn't be #12424
maybe others

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file.
Pull request includes a sign off
Code style is correct
(run the linters)

Signed-off-by: Jason Little realtyem@gmail.com

…ast_active_ts

…triggering about every 35 seconds and preventing idle.

…over sync and federation

DoM1niC · 2023-07-25T16:07:59Z

Can confirm, same Problem here

clokep

I'm not quite sure I'm grokking this code, but I left a few questions that will hopefully clarify for myself or a future reviewer.

The tests do look pretty reasonable though, so I can believe that this is fixing bugs!

clokep · 2023-08-01T14:52:45Z

synapse/handlers/presence.py

-            presence == PresenceState.BUSY and self._busy_presence_enabled
-        ):
+        if (
+            prev_state.state == PresenceState.OFFLINE


I'm failing to follow why this is specifically for offline->online; naively I would have expected if it is going from !online->online.

(The busy presence bit seems like it might have the same bug?)

I had a reason, which is entirely escaping me right now(and appears to be missing from my notes). Let me test it out, because if this holds up we can remove the 'hacky' bit in handle_update(). I think it was related to not using set_presence on the /sync query, as it defaults to 'online'. However, that behavior has recently changed(at least for Element, other clients may not have). Give me some time to look into it again.

OK! The various possible state transitions are rather fuzzy in my head, which is why I'm asking. 😄

I went ahead and made those changes. No deal(at least as-is). It will need something else to handle the fact that some clients still set presence via the presence endpoint and not through /sync.

https://github.com/matrix-org/synapse/actions/runs/5745219286

I left my desktop client open(so it should have been syncing in the background, at least until it went to sleep). But then I checked in real quick with my Android mobile, a few minutes later this nonsense started spamming federation.

clokep · 2023-08-01T14:55:56Z

synapse/handlers/presence.py

+        # Need to remove the user_id from users_going_offline or the looping call will
+        # tell the presence writer that the user went offline.
+        self.users_going_offline.pop(user_id, None)
+        self.send_user_sync(user_id, True, self.clock.time_msec())


I'm not quite following why we remove the if-statement here. Can you explain a bit more what this is doing?

I mentioned that above in Number 2 under Three-fold fix.
Some context to keep in mind before I begin the long version:

This is only applies on a worker model of synapse where there is either a sync worker or a presence writer(at the very least). This does not apply on a monolith deployment.

syncs can happen sometimes in rapid or sometime in slow succession, depending on how many other users the homeserver is aware exist + any room updates coming along

There is a looping call every 10 seconds, that will mark a user as 'offline' if they are in users_going_offline and at least 10 seconds has passed. These timers are not synchronized in any way and a theoretical 19 seconds could occur. The looping call uses the companion to send_user_sync() called send_stop_syncing() which makes an explicit call to the presence writer to mark them 'offline'.

This is tied to the context manager that wraps a call to the sync handler here

Now

At the _end() of the sync context manager, it used to add user_id as a key to users_going_offline with a value of the current time in milliseconds.

On the next sync, user_id is pop()ped off of users_going_offline and checked for None(or more specifically, would check that there was not a value), and it would only then send_user_sync(). This only occurs when it is the very first sync of a recent-ish session otherwise it will not be None and send_user_sync() would not be called.

So, we are mid sync and the presence writer has not been informed we are still around. Finish the context manager, and repeat however many times it takes to get to about 10 seconds(specifically the next time the looping call comes around to call send_stop_syncing() and mark them offline).

10 seconds has passed, the context manager is still doing it's thing and knows the user is still online(obviously, they are still mid-sync after all). Over on the presence writer however, no indication has happened in 10 seconds that the user is still online: no further USER_SYNC has come over TCP replication, and remember that set_state() no longer bumps last_active_ts on a regular basis anymore. The machinery kicks into gear to mark the user as 'offline', which then gets sent over replication back to the sync handler side.

At this point, the presence writer thinks the user is 'offline' and the sync handler(well, it's associated WorkerPresenceHandler, actually) thinks the user is 'online'. The call from the presence writer will overwrite what the context manager thinks at this point. Still no update to the last_active_ts, by the way.

On the next sync, the context manager saw they were 'offline' last and puts the user back to 'online', beginning the cycle again. You can see this in metrics under the spoiler arrow with the words "One without Two resulted in this bad dream(every 35 seconds)".

Now, you are probably thinking, "Well, let's just remove the not in the conditional to check for an actual time". I considered that, but then the presence writer doesn't get it's first notification that the user is syncing and that felt like undefined behavior. Honestly, it might work out(and is a place I'm planning on revisiting in the future as a bit of refactor, there are more things to touch here). I will test this if you wish, my instinct says as long as the second sync of a session is within 10 seconds then everything will be fine.

And, to be fair, fully fixing everything possible wasn't the scope of this PR. I wanted 'unavailable' to work as advertised. But this was affecting the behavior in a negative way and wasn't going to be allowed to get away, a traditional 'bug' that was fully exposed by last_active_ts not being updated on every single sync any longer.

clokep · 2023-08-01T14:57:33Z

synapse/handlers/presence.py

+            if prev_state.state == PresenceState.UNAVAILABLE:
+                # If we had idled out, but are still syncing, then new state will look
+                # like it's online. Verify with last_active_ts, as that is a sign of a
+                # pro-active event. Override that we are online when we probably aren't.
+                if now - new_state.last_active_ts > IDLE_TIMER:
+                    new_state = new_state.copy_and_replace(
+                        state=PresenceState.UNAVAILABLE
+                    )


This feels a bit hacky -- can we avoid trying to update in this case at all instead backtracking from saying we're online to unavailable?

erikjohnston · 2023-08-02T12:16:16Z

@realtyem FYI it sounds like there has been a regression in Element Web which might be a root cause for us seeing increasing problems around this area. I'm not sure how much that will affect this PR

realtyem · 2023-08-02T12:32:07Z

@realtyem FYI it sounds like there has been a regression in Element Web which might be a root cause for us seeing increasing problems around this area. I'm not sure how much that will affect this PR

I'm aware, I'm still running some tests(they take a while to accumulate the data) to make sure this all holds true. I believe it does. It should be ok still, though I am currently side tracked by checking @clokep 's ask about the mentioned condition being not more encompassing. I should know for sure in about 24 hours(give or take)

…st_active_ts is updated if we were anything but already ONLINE

… that last_active_ts is updated if we were anything but already ONLINE" This reverts commit 6b7e273.

clokep · 2023-08-04T13:03:28Z

tests/handlers/test_presence.py

+        # last_active_ts should not have changed as no pro-active event has occurred
+        self.assertEqual(prev_state.last_active_ts, state.last_active_ts)


@erikjohnston and I went through this and we think there's a misunderstanding here -- last_active_ts should be updated for each sync. It seems like not updating this except for the initial offline -> online transition is fixing the bug, at the expense of never updating last_active_ts.

Reading just that summary that sounds like a recipe for disaster.

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

My understanding is that updating the last_active_ts should not cause new data to be sent down sync while you're actively syncing.

Ah yes

https://spec.matrix.org/v1.7/client-server-api/#last-active-ago

To reduce the number of presence updates sent to clients the server may include a currently_active boolean field when the presence state is online. When true, the server will not send further updates to the last active time until an update is sent to the client with either a) currently_active set to false or b) a presence state other than online. During this period clients must consider the user to be currently active, irrespective of the last active time.

To be clear

There is more to do in this area, but will be on a different PR as it may be somewhat contentious.

I wasn't done in this area. This doesn't completely fix that last_active_ts is being sent inappropriately, just that part that was causing it to be affecting idle mode(I'm trying to stay inside the scope of the point of the PR. The TCP replication was one of the things that was masking and keeping this from working as well, or it would also have been separate).

To fix last_active_ts being updated inappropriately, only really means that if currently_active is True we don't have to send anything if the only thing to send is an update to last_active_ts. The only time last_active_ts should be sent is when currently_active is False and that update should be changing currently_active to True on the same packet.

State changes are separate from this, and this PR is only supposed to be about making the state change work. I'm sorry this is making problems. I can close this and separate it all out if you would rather. (But I'm not updating the pictures, those were a pain to do the first time)

realtyem · 2023-08-08T09:01:42Z

I'm just going to break this into pieces. Maybe that will be easier to deal with.

realtyem added 6 commits July 24, 2023 02:15

Fix last_active_ts bug where ONLINE would always force an update to l…

d6a6dff

…ast_active_ts

Add test for checking online->online doesn't change last_active_ts

9f2fa60

Fix race between send_user_sync() and send_stop_syncing(), which was …

b4e3987

…triggering about every 35 seconds and preventing idle.

Fix UNAVAILABLE not staying set and causing a flicker to be sent out …

f350e01

…over sync and federation

Add test for idle->idle from sync

dc8cbdd

changelog

bf71e78

This was referenced Jul 24, 2023

Fix last_active_ts Presence bug #15859

Closed

Current state of Presence #15877

Open

realtyem marked this pull request as ready for review July 24, 2023 12:52

realtyem requested a review from a team as a code owner July 24, 2023 12:52

realtyem mentioned this pull request Jul 24, 2023

Stop presence active time being updated too frequently... #15995

Closed

4 tasks

clokep reviewed Aug 1, 2023

View reviewed changes

clokep requested a review from a team August 1, 2023 14:58

realtyem added 2 commits August 2, 2023 19:37

Combine the condition checking previous state to new state so that la…

6b7e273

…st_active_ts is updated if we were anything but already ONLINE

Revert "Combine the condition checking previous state to new state so…

183e5f5

… that last_active_ts is updated if we were anything but already ONLINE" This reverts commit 6b7e273.

clokep reviewed Aug 4, 2023

View reviewed changes

realtyem closed this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presence fix for idle mode #15989

Presence fix for idle mode #15989

realtyem commented Jul 24, 2023 •

edited

Loading

DoM1niC commented Jul 25, 2023

clokep left a comment

clokep Aug 1, 2023

realtyem Aug 2, 2023

clokep Aug 2, 2023

realtyem Aug 3, 2023

clokep Aug 1, 2023

realtyem Aug 2, 2023 •

edited

Loading

clokep Aug 1, 2023

erikjohnston commented Aug 2, 2023

realtyem commented Aug 2, 2023

clokep Aug 4, 2023

t3chguy Aug 4, 2023

clokep Aug 4, 2023 •

edited

Loading

t3chguy Aug 4, 2023

realtyem Aug 4, 2023

realtyem commented Aug 8, 2023

		# last_active_ts should not have changed as no pro-active event has occurred
		self.assertEqual(prev_state.last_active_ts, state.last_active_ts)

Presence fix for idle mode #15989

Presence fix for idle mode #15989

Conversation

realtyem commented Jul 24, 2023 • edited Loading

Pull Request Checklist

DoM1niC commented Jul 25, 2023

clokep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realtyem Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikjohnston commented Aug 2, 2023

realtyem commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clokep Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realtyem commented Aug 8, 2023

realtyem commented Jul 24, 2023 •

edited

Loading

realtyem Aug 2, 2023 •

edited

Loading

clokep Aug 4, 2023 •

edited

Loading