Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

/_matrix/federation/v1/state_ids/ is heavy #7893

Closed
richvdh opened this issue Jul 18, 2020 · 12 comments
Closed

/_matrix/federation/v1/state_ids/ is heavy #7893

richvdh opened this issue Jul 18, 2020 · 12 comments
Labels
A-Federation A-Performance Performance, both client-facing and admin-facing

Comments

@richvdh
Copy link
Member

richvdh commented Jul 18, 2020

in a large room with long auth chains, each request to /_matrix/federation/v1/state_ids/ takes a lot of database time. The requesting servers tend to time out the requests, and then retry, compounding the problem.

@richvdh
Copy link
Member Author

richvdh commented Jul 18, 2020

see #6597: it's not obvious that this data is even useful.

@anoadragon453 anoadragon453 added p1 A-Performance Performance, both client-facing and admin-facing labels Jul 20, 2020
@anoadragon453
Copy link
Member

It seems now that we do want to keep this behaviour, but optimize the performance of it as per #6597 (comment)

@anoadragon453
Copy link
Member

@richvdh Do you think it worth keeping this issue open if we have #6597?

@richvdh
Copy link
Member Author

richvdh commented Jul 21, 2020

sorry, yes. This issue is about the database/CPU time taken on the server side; I've actually got a couple of patches which I need to PR to improve it.

#6597 is about handling the response on the client side.

richvdh added a commit that referenced this issue Jul 21, 2020
If we send out an event which refers to `prev_events` which other servers in
the federation are missing, then (after a round or two of backfill attempts),
they will end up asking us for `/state_ids` at a particular point in the DAG.

As per #7893, this is quite
expensive, and we tend to see lots of very similar requests around the same
time.

We can therefore handle this much more efficiently by using a cache, which (a)
ensures that if we see the same request from multiple servers (or even the same
server, multiple times), then they share the result, and (b) any other servers
that miss the initial excitement can also benefit from the work.

[It's interesting to note that `/state` has a cache for exactly this
reason. `/state` is now essentially unused and replaced with `/state_ids`, but
evidently when we replaced it we forgot to add a cache to the new endpoint.]
richvdh added a commit that referenced this issue Jul 23, 2020
If we send out an event which refers to `prev_events` which other servers in
the federation are missing, then (after a round or two of backfill attempts),
they will end up asking us for `/state_ids` at a particular point in the DAG.

As per #7893, this is quite
expensive, and we tend to see lots of very similar requests around the same
time.

We can therefore handle this much more efficiently by using a cache, which (a)
ensures that if we see the same request from multiple servers (or even the same
server, multiple times), then they share the result, and (b) any other servers
that miss the initial excitement can also benefit from the work.

[It's interesting to note that `/state` has a cache for exactly this
reason. `/state` is now essentially unused and replaced with `/state_ids`, but
evidently when we replaced it we forgot to add a cache to the new endpoint.]
@richvdh
Copy link
Member Author

richvdh commented Jul 29, 2020

Hopefully this is improved by #7931 and #7930.

@richvdh richvdh closed this as completed Jul 29, 2020
@richvdh
Copy link
Member Author

richvdh commented Aug 28, 2020

this is still a problem; matrix.org's federation readers are still pinning the CPU and hammering the database regularly.

@richvdh
Copy link
Member Author

richvdh commented Sep 28, 2020

This is exacerbated by rooms which have complex graphs. A single back-pagination request in !YynUnYHpqlHuoTAjsp:matrix.org on matrix.org introduced 51 new backwards-extremities, each of which required a state_ids request to kde.org, each of which takes 3 minutes of database time.

@richvdh
Copy link
Member Author

richvdh commented Sep 28, 2020

(the above also highlights the fact that state_ids is returning way too much data: there will be a lot of duplication in the auth chains of those 51 backwards-extremities, which suggests we should spec an updated state_ids that returns less data.)

@richvdh
Copy link
Member Author

richvdh commented Sep 28, 2020

we should spec an updated state_ids that returns less data

or follow construct's example, and send the request with an auth_chain_ids=0 query param, to indicate we don't need any of the auth chain ids at all.

@richvdh
Copy link
Member Author

richvdh commented Sep 29, 2020

@erikjohnston also suggests adding a cache to the middle of get_auth_chain_ids and caching intermediate results.

@richvdh
Copy link
Member Author

richvdh commented Feb 24, 2021

#9482 might help with this. I still feel like a lot of the problem here is that we tend to return a huge amount of redundant information, since the requesting server will have most of the auth_events already.

@callahad
Copy link
Contributor

The comment above stated:

This issue is about the database/CPU time taken on the server side

Screenshot of CPU utilization on matrix.org

#9482 improved CPU and DB utilization by an order of magnitude, and is shipping in 1.30.0rc1. Closing as resolved. 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Performance Performance, both client-facing and admin-facing
Projects
None yet
Development

No branches or pull requests

4 participants