Performance of device list updates could be improved (aka /login is unusably slow) #7721

babolivier · 2020-06-18T12:27:55Z

I was investigating why logging in with a new device was leading up to poor performances on my account. It seems like both sending the device list update to other homeservers and other homeservers in turn hitting GET /_matrix/federation/v1/user/devices/{userId}.

Looking a bit further, it looks like both code paths end up calling _get_e2e_device_keys_txn in the database code, which seems to be pretty expensive, and is called multiple time in a row, yet isn't cached. This results in quite bad performances:

On the bottom right graph the _get_e2e_device_keys_txn transaction (in blue) is originating from sending out the m.device_list_update EDUs, while the get_devices_with_keys_by_user one (in purple) is originating from responding to GET /_matrix/federation/v1/user/devices/{userId}. Both transactions involve calling the _get_e2e_device_keys_txn function (in fact, the first one does only that).
On the bottom left graph these two transactions are colored respectively in green and purple.

The text was updated successfully, but these errors were encountered:

erikjohnston · 2020-06-18T13:56:11Z

I think the next steps here are to add a cache 👍

hifi · 2022-02-21T13:32:06Z

In #12048 I had an issue where logging in with a new device pretty much kills synapse for a while because of the device notify on login. Just batching up and slowing down sending the notifies would help a lot.

richvdh · 2022-03-02T10:37:54Z

It's worth mentioning that this effectively prevents logging in with Hydrogen (cf element-hq/hydrogen-web#699)

richvdh · 2022-03-02T10:39:02Z

oh, and fluffychat, which seems to do the same thing

erikjohnston · 2022-03-02T10:58:44Z

Note that there seems to be two related but distinct problems:

_get_e2e_device_keys_txn is slow as mentioned in the issue description
notify_device_updates is slow

@richvdh ooi what were you seeing? Or was it a combination of the two?

richvdh · 2022-03-02T11:39:00Z

I was seeing all requests to my HS responding with a 502, due to my reverse-proxy timing out. The mean reactor tick time is about 15 seconds, which means Synapse is essentially unable to respond to anything. I think it's very busy sending out federation updates.

richvdh · 2022-03-02T13:54:08Z

related: #5373

hifi · 2022-03-02T15:12:53Z

@erikjohnston In #12048 I disabled notify_device_updates temporarily and I could login just fine without delays and Synapse was chugging along without stalling.

erikjohnston · 2022-03-04T16:20:46Z

#12132 will hopefully help a bit here, as it smears out sending out device list updates a bit. However, it won't help for the other issues that we're seeing.

My current thinking is that we should try and get rid of the use of store.get_users_who_share_room_with_user in notify_device_update, as that is expensive for larger accounts (both in CPU and memory terms). We can instead pull out the rooms for the user, which is enough to wake up the notifier. Then, instead of recording the device update per remote server, we store the device update per room¹. When the federation sender pulls out the device updates it can find the hosts in the room.

This does lead to the same amount of work being done (as self.store.get_users_who_share_room_with_user effectively just does get_rooms_for_user followed by a loop of get_users_in_room), but has the advantage of:

a lot of the process happens asynchronously, so we can return the /login (or whatever) request sooner; and
the federation sender can prioritise which rooms and hosts to send out to first, e.g. it could handle rooms in order of when the last message was sent, so that we quickly get device updates to servers that we talk to frequently at the expense of servers that we haven't talked to for ages.

We can also do a vaguely similar thing for presence, i.e. get rid of store.get_users_who_share_room_with_user, which should help the perf overhead of presence.

¹. This can be done in a backwards compatible way be recording in the DB both device list updates by server and by room at the same time, then in a future release stop doing the former and bump the minimum DB schema version.

ara4n · 2022-03-19T14:10:17Z

I literally can't log in any more thanks to this. It would be very much appreciated if this could be prioritised...

This is a first step in dealing with #7721. The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things: 1. we can defer calculating the set of remote servers that need to be poked about the change; and 2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed. However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled. There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly). Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.

erikjohnston · 2022-04-27T09:11:27Z

1.58.0rc1 should now include the fix described in #7721 (comment), so I think this is fixed from a user perspective. Logging in on a large account still takes a lot of CPU and DB though, so we may want to investigate that, but for now I think we can close this issue.

chagai95 · 2022-06-06T15:59:56Z

I still can not log in with an account with many rooms, I have 8 GB RAM and 8 cores, is there anything in the logs that I can keep an eye out for? I tried playing around with get_users_who_share_room_with_user and get_users_in_room but had no success. Anything else I am missing perhaps?

clokep added z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Jun 18, 2020

erikjohnston added the Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution label Jun 25, 2020

richvdh changed the title ~~Performances for device list updates could be improved~~ Performance of device list updates could be improved Jul 8, 2020

kegsay mentioned this issue Jul 29, 2020

MSC2638: Ability for clients to request homeservers to resync device lists matrix-org/matrix-spec-proposals#2638

Closed

emorrp1 mentioned this issue Sep 30, 2020

Summary of performance impact of running on resource constrained devices such as SBCs #8428

Closed

babolivier mentioned this issue Feb 21, 2022

Deleting devices is *incredibly* resource-hungry #2704

Open

richvdh mentioned this issue Feb 21, 2022

Notify device update on login causing massive amount of federation #12048

Closed

erikjohnston added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Feb 23, 2022

richvdh mentioned this issue Mar 2, 2022

Unable to log in element-hq/hydrogen-web#699

Open

erikjohnston mentioned this issue Mar 2, 2022

Spread out sending device lists to remote hosts #12132

Merged

ara4n mentioned this issue Mar 19, 2022

Logging in can cause CPU spike due to device lists #4934

Closed

ara4n changed the title ~~Performance of device list updates could be improved~~ Performance of device list updates could be improved (aka /login is unusably slow) Mar 19, 2022

ara4n mentioned this issue Mar 19, 2022

Timeout on /login needs to be much longer. element-hq/element-x-ios#1

Closed

callahad assigned erikjohnston Mar 24, 2022

erikjohnston mentioned this issue Mar 29, 2022

Track device list updates per room. #12321

Merged

squahtx mentioned this issue Apr 11, 2022

Can not log in with a specific account #12429

Closed

erikjohnston closed this as completed Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of device list updates could be improved (aka /login is unusably slow) #7721

Performance of device list updates could be improved (aka /login is unusably slow) #7721

babolivier commented Jun 18, 2020 •

edited

Loading

erikjohnston commented Jun 18, 2020

hifi commented Feb 21, 2022

richvdh commented Mar 2, 2022

richvdh commented Mar 2, 2022

erikjohnston commented Mar 2, 2022 •

edited

Loading

richvdh commented Mar 2, 2022 •

edited

Loading

richvdh commented Mar 2, 2022

hifi commented Mar 2, 2022

erikjohnston commented Mar 4, 2022 •

edited

Loading

ara4n commented Mar 19, 2022

erikjohnston commented Apr 27, 2022

chagai95 commented Jun 6, 2022

Performance of device list updates could be improved (aka /login is unusably slow) #7721

Performance of device list updates could be improved (aka /login is unusably slow) #7721

Comments

babolivier commented Jun 18, 2020 • edited Loading

erikjohnston commented Jun 18, 2020

hifi commented Feb 21, 2022

richvdh commented Mar 2, 2022

richvdh commented Mar 2, 2022

erikjohnston commented Mar 2, 2022 • edited Loading

richvdh commented Mar 2, 2022 • edited Loading

richvdh commented Mar 2, 2022

hifi commented Mar 2, 2022

erikjohnston commented Mar 4, 2022 • edited Loading

ara4n commented Mar 19, 2022

erikjohnston commented Apr 27, 2022

chagai95 commented Jun 6, 2022

babolivier commented Jun 18, 2020 •

edited

Loading

erikjohnston commented Mar 2, 2022 •

edited

Loading

richvdh commented Mar 2, 2022 •

edited

Loading

erikjohnston commented Mar 4, 2022 •

edited

Loading