Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Performance of device list updates could be improved (aka /login is unusably slow) #7721

Closed
babolivier opened this issue Jun 18, 2020 · 12 comments
Assignees
Labels
A-Performance Performance, both client-facing and admin-facing T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution z-p2 (Deprecated Label)

Comments

@babolivier
Copy link
Contributor

babolivier commented Jun 18, 2020

I was investigating why logging in with a new device was leading up to poor performances on my account. It seems like both sending the device list update to other homeservers and other homeservers in turn hitting GET /_matrix/federation/v1/user/devices/{userId}.

Looking a bit further, it looks like both code paths end up calling _get_e2e_device_keys_txn in the database code, which seems to be pretty expensive, and is called multiple time in a row, yet isn't cached. This results in quite bad performances:

image

On the bottom right graph the _get_e2e_device_keys_txn transaction (in blue) is originating from sending out the m.device_list_update EDUs, while the get_devices_with_keys_by_user one (in purple) is originating from responding to GET /_matrix/federation/v1/user/devices/{userId}. Both transactions involve calling the _get_e2e_device_keys_txn function (in fact, the first one does only that).
On the bottom left graph these two transactions are colored respectively in green and purple.

@clokep clokep added z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Jun 18, 2020
@erikjohnston
Copy link
Member

I think the next steps here are to add a cache 👍

@erikjohnston erikjohnston added the Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution label Jun 25, 2020
@richvdh richvdh changed the title Performances for device list updates could be improved Performance of device list updates could be improved Jul 8, 2020
@hifi
Copy link
Contributor

hifi commented Feb 21, 2022

In #12048 I had an issue where logging in with a new device pretty much kills synapse for a while because of the device notify on login. Just batching up and slowing down sending the notifies would help a lot.

@erikjohnston erikjohnston added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Feb 23, 2022
@richvdh
Copy link
Member

richvdh commented Mar 2, 2022

It's worth mentioning that this effectively prevents logging in with Hydrogen (cf element-hq/hydrogen-web#699)

@richvdh
Copy link
Member

richvdh commented Mar 2, 2022

oh, and fluffychat, which seems to do the same thing

@erikjohnston
Copy link
Member

erikjohnston commented Mar 2, 2022

Note that there seems to be two related but distinct problems:

  1. _get_e2e_device_keys_txn is slow as mentioned in the issue description
  2. notify_device_updates is slow

@richvdh ooi what were you seeing? Or was it a combination of the two?

@richvdh
Copy link
Member

richvdh commented Mar 2, 2022

I was seeing all requests to my HS responding with a 502, due to my reverse-proxy timing out. The mean reactor tick time is about 15 seconds, which means Synapse is essentially unable to respond to anything. I think it's very busy sending out federation updates.

@richvdh
Copy link
Member

richvdh commented Mar 2, 2022

related: #5373

@hifi
Copy link
Contributor

hifi commented Mar 2, 2022

@erikjohnston In #12048 I disabled notify_device_updates temporarily and I could login just fine without delays and Synapse was chugging along without stalling.

@erikjohnston
Copy link
Member

erikjohnston commented Mar 4, 2022

#12132 will hopefully help a bit here, as it smears out sending out device list updates a bit. However, it won't help for the other issues that we're seeing.

My current thinking is that we should try and get rid of the use of store.get_users_who_share_room_with_user in notify_device_update, as that is expensive for larger accounts (both in CPU and memory terms). We can instead pull out the rooms for the user, which is enough to wake up the notifier. Then, instead of recording the device update per remote server, we store the device update per room1. When the federation sender pulls out the device updates it can find the hosts in the room.

This does lead to the same amount of work being done (as self.store.get_users_who_share_room_with_user effectively just does get_rooms_for_user followed by a loop of get_users_in_room), but has the advantage of:

  1. a lot of the process happens asynchronously, so we can return the /login (or whatever) request sooner; and
  2. the federation sender can prioritise which rooms and hosts to send out to first, e.g. it could handle rooms in order of when the last message was sent, so that we quickly get device updates to servers that we talk to frequently at the expense of servers that we haven't talked to for ages.

We can also do a vaguely similar thing for presence, i.e. get rid of store.get_users_who_share_room_with_user, which should help the perf overhead of presence.


1. This can be done in a backwards compatible way be recording in the DB both device list updates by server and by room at the same time, then in a future release stop doing the former and bump the minimum DB schema version.

@ara4n
Copy link
Member

ara4n commented Mar 19, 2022

I literally can't log in any more thanks to this. It would be very much appreciated if this could be prioritised...

@ara4n ara4n changed the title Performance of device list updates could be improved Performance of device list updates could be improved (aka /login is unusably slow) Mar 19, 2022
erikjohnston added a commit that referenced this issue Apr 4, 2022
This is a first step in dealing with #7721.

The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:

1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.

However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.

There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).

Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
@erikjohnston
Copy link
Member

1.58.0rc1 should now include the fix described in #7721 (comment), so I think this is fixed from a user perspective. Logging in on a large account still takes a lot of CPU and DB though, so we may want to investigate that, but for now I think we can close this issue.

@chagai95
Copy link
Contributor

chagai95 commented Jun 6, 2022

I still can not log in with an account with many rooms, I have 8 GB RAM and 8 cores, is there anything in the logs that I can keep an eye out for? I tried playing around with get_users_who_share_room_with_user and get_users_in_room but had no success. Anything else I am missing perhaps?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

7 participants