Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Presence fix for idle mode #15989

Closed

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Jul 24, 2023

Presence has an idle mode, called 'Unavailable' which currently doesn't work right and is generating some additional traffic. Make it behave.

This is what it looks like if you zoom in to a 1 second resolution.

presence_flicker_1

And it's repetitive. This is my client sitting open and largely untouched.

presence_flicker_frequency

But it doesn't ever actually idle. It goes back to online in nearly the same second. For testing this PR and for all the following pictures, I parked myself into an empty room so it would not be contaminated by read receipts, read markers, or typing (and of course didn't send myself any messages) as all of those will bump last_active_ts and not allow idling. I also ramped up the scraping interval and changed Grafana's display points to use $__rate_interval so I could get 1 second resolution on the timing(which makes the scale weird). The point was that I could see how often it was happening.

A client/user is considered idle after 5 minutes of no 'pro-active' events or activity. As you can see, it's not quite....right.

The fix is three-fold

  1. Fix the last_active_ts bug, which allows a user to not update their last_active_ts just by syncing.
  2. Stop a race between TCP replication calls to USER_SYNC in which one would say we are 'online' and another would say we are 'offline', creating a flicker(@nico-famedly). This appeared to only occur on a worker-mode dance between a sync worker and a presence stream_writer, but would still propagate over federation as it was a change of state. I want to thank @anoadragon453 for putting on an archeology hat and providing the fix for this. I don't think I would have looked here if it wasn't for that.
  3. Add a condition that checks that a PresenceState was in 'unavailable'(which is what idle is called by the spec) and verifies that per the last_active_ts to see if the call to be 'online' is accurate. I want to thank @jevolk for letting me borrow his homeserver to see what was happening on the other side of federation.
One without Two resulted in this bad dream(every 35 seconds):

Just the yellow one on the left, the orange one on the right we'll deal with in Three.

offline_flicker

I called it after 30 minutes that this was enough of that.

offline_flicker_zoomed_out

One and Two without Three gave this nightmare(which didn't start for over 5 minutes, but then every 5 seconds):

Close up on these is up above.

idle_flicker_fail

And I threw in some tests to check this stuff. They pass now but on develop you get:

===============================================================================
[FAIL]
Traceback (most recent call last):
  File "/home/jason/matrix-org/synapse/tests/handlers/test_presence.py", line 829, in test_syncing_appropriately_handles_last_active_ts
    self.assertEqual(prev_state.last_active_ts, state.last_active_ts)
  File "/home/jason/.cache/pypoetry/virtualenvs/matrix-synapse-8koxSNBY-py3.10/lib/python3.10/site-packages/twisted/trial/_synctest.py", line 441, in assertEqual
    super().assertEqual(first, second, msg)
  File "/usr/lib/python3.10/unittest/case.py", line 845, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/usr/lib/python3.10/unittest/case.py", line 838, in _baseAssertEqual
    raise self.failureException(msg)
twisted.trial.unittest.FailTest: 2000 != 4000

tests.handlers.test_presence.PresenceHandlerTestCase.test_syncing_appropriately_handles_last_active_ts
===============================================================================
[FAIL]
Traceback (most recent call last):
  File "/home/jason/matrix-org/synapse/tests/handlers/test_presence.py", line 330, in test_idle_while_syncing_stays_idle
    self.assertFalse(persist_and_notify)
  File "/home/jason/.cache/pypoetry/virtualenvs/matrix-synapse-8koxSNBY-py3.10/lib/python3.10/site-packages/twisted/trial/_synctest.py", line 386, in assertFalse
    super().assertFalse(condition, msg)
  File "/usr/lib/python3.10/unittest/case.py", line 681, in assertFalse
    raise self.failureException(msg)
twisted.trial.unittest.FailTest: True is not false

tests.handlers.test_presence.PresenceUpdateTestCase.test_idle_while_syncing_stays_idle

(For those that were paying attention, you may have noticed that the original version of this PR had the conditional to fix last_active_ts check against != PresenceState.ONLINE. This was affecting using idle, so was changed to exclude it.)

Hey look, it works now

online

online_3_min

idle_6_min

idle_33_min

Correct idling federation traffic:

  • Spike one is coming online
  • Spike two is currently_active going false
  • Spike three is going idle(unavailable)
  • Spike four is the federation ping

outgoing_edu_final

presence_behaving

Output provided by jevolk's server(does not match up with metrics above, but was used to verify unavailable state)
Sun, 23 Jul 2023 06:30:23 +0000 {"currently_active":true,"last_active_ago":464,"presence":"online","user_id":"@<me>"} $eWtdp-Km0YduwJzMdsYcAhsOC4BteBzBpNDbWNTELQQ
Sun, 23 Jul 2023 06:22:56 +0000 {"currently_active":false,"last_active_ago":305634,"presence":"unavailable","user_id":"@<me>"} $pogN1n4O2OHAEj5LIF_BHKd435zyFZIQNWSNhg9P0Ag
Sun, 23 Jul 2023 06:18:56 +0000 {"currently_active":false,"last_active_ago":65630,"presence":"online","user_id":"@<me>"} $fqAhopBuouGn4bnl6WgCk-HzMjrBEx4j4tpNAtP_Km0
Sun, 23 Jul 2023 06:17:45 +0000 {"currently_active":true,"last_active_ago":347,"presence":"online","user_id":"@<me>"} $W4LcbnApxYcPz9KWkwoomdUZfv3USqpc6VVlgY1AOhg
Sun, 23 Jul 2023 06:16:01 +0000 {"currently_active":false,"last_active_ago":303489,"presence":"unavailable","user_id":"@<me>"} $frgWplfuLdWx8rKxPlpyQ3OXTAMVJaaELNYlYvL1-ZA
Sun, 23 Jul 2023 06:12:01 +0000 {"currently_active":false,"last_active_ago":62724,"presence":"online","user_id":"@<me>"} $A0sT_8xg-UbVUEvi8Sk0e7YnlvmWubPwsO9E9VTkRlw
Sun, 23 Jul 2023 06:10:16 +0000 {"currently_active":true,"last_active_ago":168,"presence":"online","user_id":"@<me>"} $WCTjiYvvJMUC5prdZGP01STaSAEp8qrApxGZ9Udcxso
Sun, 23 Jul 2023 06:09:56 +0000 {"currently_active":false,"last_active_ago":63547,"presence":"online","user_id":"@<me>"} $lw3ujOfLF7cZtMhqAPse1mA9uB-ffevqqcln5JJqg-M
Sun, 23 Jul 2023 06:04:16 +0000 {"currently_active":true,"last_active_ago":235,"presence":"online","user_id":"@<me>"} $sFl_bBDjpr3ZyZyEz7wrjsr7N4iAcd8iny05oEKxxJE
Sun, 23 Jul 2023 06:03:31 +0000 {"currently_active":false,"last_active_ago":62517,"presence":"online","user_id":"@<me>"} $80BGZnbXSg6Tj_d-gAMz_uHp97XUFNXpsjprnjVmV8g
Sun, 23 Jul 2023 06:01:35 +0000 {"currently_active":true,"last_active_ago":126,"presence":"online","user_id":"@<me>"} $rENTSkRYg_HYwtJI0UK1ubjf4hJ0ckvFOLQPOJrOnWU
Sun, 23 Jul 2023 06:01:28 +0000 {"currently_active":false,"last_active_ago":62886,"presence":"online","user_id":"@<me>"} $NxA6uvC6ekLR1plHtjLC3GNbF5SPZ0wyDPsPfeiCbnw
Sun, 23 Jul 2023 05:58:07 +0000 {"currently_active":true,"last_active_ago":264,"presence":"online","user_id":"@<me>"} $n4ebVHKlrlPtFoofgISFmiUQ4NAbsOHqEZYNz7rzccI
Sun, 23 Jul 2023 05:50:58 +0000 {"currently_active":true,"last_active_ago":2761,"presence":"online","user_id":"@<me>"} $jRJWtnex4F-5dkVs9YO3VG4sHK6Myi2cBcgfXoaPAYg
Sun, 23 Jul 2023 05:41:11 +0000 {"currently_active":true,"last_active_ago":3188,"presence":"online","user_id":"@<me>"} $y3p-j2XIjoAtAONLo-MDKPsGB0dJe2cmOxA0AXFn2sY
Sun, 23 Jul 2023 05:14:49 +0000 {"currently_active":false,"last_active_ago":98635,"presence":"online","user_id":"@<me>"} $GbeWx4z3FhbkKQfkNS37sIhJvRLopGUCGKllPlPTOQk

There is more to do in this area, but will be on a different PR as it may be somewhat contentious.

Should...

Pull Request Checklist

Signed-off-by: Jason Little realtyem@gmail.com

@realtyem realtyem marked this pull request as ready for review July 24, 2023 12:52
@realtyem realtyem requested a review from a team as a code owner July 24, 2023 12:52
@DoM1niC
Copy link

DoM1niC commented Jul 25, 2023

Can confirm, same Problem here

Copy link
Contributor

@clokep clokep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I'm grokking this code, but I left a few questions that will hopefully clarify for myself or a future reviewer.

The tests do look pretty reasonable though, so I can believe that this is fixing bugs!

presence == PresenceState.BUSY and self._busy_presence_enabled
):
if (
prev_state.state == PresenceState.OFFLINE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm failing to follow why this is specifically for offline->online; naively I would have expected if it is going from !online->online.

(The busy presence bit seems like it might have the same bug?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a reason, which is entirely escaping me right now(and appears to be missing from my notes). Let me test it out, because if this holds up we can remove the 'hacky' bit in handle_update(). I think it was related to not using set_presence on the /sync query, as it defaults to 'online'. However, that behavior has recently changed(at least for Element, other clients may not have). Give me some time to look into it again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! The various possible state transitions are rather fuzzy in my head, which is why I'm asking. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made those changes. No deal(at least as-is). It will need something else to handle the fact that some clients still set presence via the presence endpoint and not through /sync.

https://github.com/matrix-org/synapse/actions/runs/5745219286

presence_temp

I left my desktop client open(so it should have been syncing in the background, at least until it went to sleep). But then I checked in real quick with my Android mobile, a few minutes later this nonsense started spamming federation.

# Need to remove the user_id from users_going_offline or the looping call will
# tell the presence writer that the user went offline.
self.users_going_offline.pop(user_id, None)
self.send_user_sync(user_id, True, self.clock.time_msec())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite following why we remove the if-statement here. Can you explain a bit more what this is doing?

Copy link
Contributor Author

@realtyem realtyem Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that above in Number 2 under Three-fold fix.
Some context to keep in mind before I begin the long version:

  • This is only applies on a worker model of synapse where there is either a sync worker or a presence writer(at the very least). This does not apply on a monolith deployment.
  • syncs can happen sometimes in rapid or sometime in slow succession, depending on how many other users the homeserver is aware exist + any room updates coming along
  • There is a looping call every 10 seconds, that will mark a user as 'offline' if they are in users_going_offline and at least 10 seconds has passed. These timers are not synchronized in any way and a theoretical 19 seconds could occur. The looping call uses the companion to send_user_sync() called send_stop_syncing() which makes an explicit call to the presence writer to mark them 'offline'.
  • This is tied to the context manager that wraps a call to the sync handler here

Now

  1. At the _end() of the sync context manager, it used to add user_id as a key to users_going_offline with a value of the current time in milliseconds.
  2. On the next sync, user_id is pop()ped off of users_going_offline and checked for None(or more specifically, would check that there was not a value), and it would only then send_user_sync(). This only occurs when it is the very first sync of a recent-ish session otherwise it will not be None and send_user_sync() would not be called.
  3. So, we are mid sync and the presence writer has not been informed we are still around. Finish the context manager, and repeat however many times it takes to get to about 10 seconds(specifically the next time the looping call comes around to call send_stop_syncing() and mark them offline).
  4. 10 seconds has passed, the context manager is still doing it's thing and knows the user is still online(obviously, they are still mid-sync after all). Over on the presence writer however, no indication has happened in 10 seconds that the user is still online: no further USER_SYNC has come over TCP replication, and remember that set_state() no longer bumps last_active_ts on a regular basis anymore. The machinery kicks into gear to mark the user as 'offline', which then gets sent over replication back to the sync handler side.
  5. At this point, the presence writer thinks the user is 'offline' and the sync handler(well, it's associated WorkerPresenceHandler, actually) thinks the user is 'online'. The call from the presence writer will overwrite what the context manager thinks at this point. Still no update to the last_active_ts, by the way.
  6. On the next sync, the context manager saw they were 'offline' last and puts the user back to 'online', beginning the cycle again. You can see this in metrics under the spoiler arrow with the words "One without Two resulted in this bad dream(every 35 seconds)".

Now, you are probably thinking, "Well, let's just remove the not in the conditional to check for an actual time". I considered that, but then the presence writer doesn't get it's first notification that the user is syncing and that felt like undefined behavior. Honestly, it might work out(and is a place I'm planning on revisiting in the future as a bit of refactor, there are more things to touch here). I will test this if you wish, my instinct says as long as the second sync of a session is within 10 seconds then everything will be fine.

And, to be fair, fully fixing everything possible wasn't the scope of this PR. I wanted 'unavailable' to work as advertised. But this was affecting the behavior in a negative way and wasn't going to be allowed to get away, a traditional 'bug' that was fully exposed by last_active_ts not being updated on every single sync any longer.

Comment on lines +1985 to +1992
if prev_state.state == PresenceState.UNAVAILABLE:
# If we had idled out, but are still syncing, then new state will look
# like it's online. Verify with last_active_ts, as that is a sign of a
# pro-active event. Override that we are online when we probably aren't.
if now - new_state.last_active_ts > IDLE_TIMER:
new_state = new_state.copy_and_replace(
state=PresenceState.UNAVAILABLE
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky -- can we avoid trying to update in this case at all instead backtracking from saying we're online to unavailable?

@clokep clokep requested a review from a team August 1, 2023 14:58
@erikjohnston
Copy link
Member

@realtyem FYI it sounds like there has been a regression in Element Web which might be a root cause for us seeing increasing problems around this area. I'm not sure how much that will affect this PR

@realtyem
Copy link
Contributor Author

realtyem commented Aug 2, 2023

@realtyem FYI it sounds like there has been a regression in Element Web which might be a root cause for us seeing increasing problems around this area. I'm not sure how much that will affect this PR

I'm aware, I'm still running some tests(they take a while to accumulate the data) to make sure this all holds true. I believe it does. It should be ok still, though I am currently side tracked by checking @clokep 's ask about the mentioned condition being not more encompassing. I should know for sure in about 24 hours(give or take)

…st_active_ts is updated if we were anything but already ONLINE
… that last_active_ts is updated if we were anything but already ONLINE"

This reverts commit 6b7e273.
Comment on lines +830 to +831
# last_active_ts should not have changed as no pro-active event has occurred
self.assertEqual(prev_state.last_active_ts, state.last_active_ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston and I went through this and we think there's a misunderstanding here -- last_active_ts should be updated for each sync. It seems like not updating this except for the initial offline -> online transition is fixing the bug, at the expense of never updating last_active_ts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading just that summary that sounds like a recipe for disaster.

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

Copy link
Contributor

@clokep clokep Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start: /sync (online) => update last_active_ts => new data to send to syncer => goto start

My understanding is that updating the last_active_ts should not cause new data to be sent down sync while you're actively syncing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes

https://spec.matrix.org/v1.7/client-server-api/#last-active-ago

To reduce the number of presence updates sent to clients the server may include a currently_active boolean field when the presence state is online. When true, the server will not send further updates to the last active time until an update is sent to the client with either a) currently_active set to false or b) a presence state other than online. During this period clients must consider the user to be currently active, irrespective of the last active time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear

There is more to do in this area, but will be on a different PR as it may be somewhat contentious.

I wasn't done in this area. This doesn't completely fix that last_active_ts is being sent inappropriately, just that part that was causing it to be affecting idle mode(I'm trying to stay inside the scope of the point of the PR. The TCP replication was one of the things that was masking and keeping this from working as well, or it would also have been separate).

To fix last_active_ts being updated inappropriately, only really means that if currently_active is True we don't have to send anything if the only thing to send is an update to last_active_ts. The only time last_active_ts should be sent is when currently_active is False and that update should be changing currently_active to True on the same packet.

State changes are separate from this, and this PR is only supposed to be about making the state change work. I'm sorry this is making problems. I can close this and separate it all out if you would rather. (But I'm not updating the pictures, those were a pain to do the first time)

@realtyem
Copy link
Contributor Author

realtyem commented Aug 8, 2023

I'm just going to break this into pieces. Maybe that will be easier to deal with.

@realtyem realtyem closed this Aug 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

last_active_ago is updated on sync when it shouldn't be
5 participants