Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

get_auth_chain_difference is super expensive #8622

Closed
richvdh opened this issue Oct 21, 2020 · 5 comments · Fixed by #8868
Closed

get_auth_chain_difference is super expensive #8622

richvdh opened this issue Oct 21, 2020 · 5 comments · Fixed by #8868
Assignees
Labels
A-Performance Performance, both client-facing and admin-facing

Comments

@richvdh
Copy link
Member

richvdh commented Oct 21, 2020

The get_auth_chain_difference transaction can often take hundreds of seconds of database time (during which all other traffic to the room is queued). This is a large component of database load from state resolution.

Caching the results of this transaction wouldn't be useful as each result is typically used only once (to populate the cache of state-resolution results), but could we cache some of the inputs which could get reused across multiple transactions (for example, the list of auth event ids for each event)?

@anoadragon453 anoadragon453 added the A-Performance Performance, both client-facing and admin-facing label Oct 22, 2020
@erikjohnston
Copy link
Member

Another thought here is that we may be able to change the database schema to make it more efficient to query. I'm not sure what this looks like in practice, but could be something like splitting up the auth chains into larger blocks to make it easier to reduce the search space? Or representing the graph in a different format?

@erikjohnston
Copy link
Member

but could we cache some of the inputs which could get reused across multiple transactions (for example, the list of auth event ids for each event)?

This was done in #8672

@erikjohnston
Copy link
Member

erikjohnston commented Nov 18, 2020

I've been thinking and experimenting a bit with this, and so I'm going to jot down a few notes here.

Having found a couple of examples of really slow calls to get_auth_chain_difference the problem appears to be when you have long chains of e.g. membership of events for a single user. This results in the algorithm taking a while walking just walking that auth chain.

The current algorithm works by walking the the auth event graph in depth order, keeping track of which events are reachable from the different state sets. Looking at the implementation and some sample runs I don't see any obvious dramatic performance improvements, and so we will likely need to make changes to the algorithm to get the necessary improvements.

Having searched around a bit I can't see anything in the literature that is directly applicable here, though there is quite a lot about efficient reachability lookups that is relevant. What we're likely looking at is adding some form of "index" that will helps us to more quickly compute the auth chain difference. Some of the key properties we're looking for here are:

  1. The index used isn't "too large" (i.e. sizes of O(n^2) are out).
  2. Supports "online" updates, i.e. adding new nodes doesn't invalidate entries in the index (assuming we don't add nodes with incoming edges). AFAICT there isn't much talk of online indices, though some algorithms do look like they can be adapted to support online updates.

http://www.cse.ust.hk/~dimitris/6311/L24-RI-Parchas.pdf has an interesting table comparing various algorithms and their performance vs index size properties, with rough overviews of each. It's worth noting that a lot of the performance characteristics depend on the "shape" of the graph, e.g. how sparse it is.

My current avenue of attack is to note that the transitive reduction of the auth chain graphs appear to mostly be very "spikey", i.e. we have long streaks of events which only have a single incoming and a single outgoing edge. This naively feels like something that should be easy to index along the lines of "this event is part of the streak that starts at X and ends at Y", which would allow efficiently skipping to the end (rather than iteratively walking down the chain).

In fact, this is very similar to the idea of a "chain cover" of a graph, where you partition the graph into "chains", like so (taken from the PDF above):

Selection_041

By labelling the chains and adding sequence numbers you can:

  1. Quickly tell how two nodes in a chain relate by comparing their sequence numbers
  2. Indexing (chain_id, sequence number) lets you quickly pull out all events between two events in a chain
  3. Storing where chains point to each other as mappings of (chain_id, sequence number) -> (chain_id, sequence number) lets us do a naive graph comparison on a much smaller graph.
  4. Alternatively we could store the "transitive closure" of that chain graph, i.e. store (chain_id, sequence number) -> (chain_id, sequence number) for every path from one chain to another. Checking if an event (C1, S1) is reachable from another (C2, S2) would then be a case looking up if there is a path from C1 -> C2 which originates at a sequence number reachable from S1 and terminates at a sequence number that can reach S2.

The above can be used to generate the auth chain difference by for each state set computing the maximum sequence number in each chain that is reachable by that state set. The auth chain difference is then all events in each chain between the maximum and minimum sequence numbers reachable by each state set. This is easy in a world where we have index as described in 4 above, but is a bit harder if we only have an index of 3 above. It's not clear to me what the practical cost of 3 vs 4 would be right now.

A simple way of computing a chain cover "online" would be to assign chains for each type/state key pair (though noting we may need multiple chains for a type/state key if there is a fork, but that should hopefully be rare). While it's impossible to generate the most efficient chain cover "online" (since it depends on the shape of the graph, which can theoretically change drastically as we get new events), we do know that long chains in the auth graph will most likely come from events of the same type/state key, and so basing the chains off type/state key should hopefully give us a decent chain covering. (Though we may be able to refine that a bit).

Building up the reachability index when inserting a new state event should be a case of:

  1. Fetch auth events, dropping those reachable by each other (e.g. if you have two joins J1 and J2 both pointing at P1 then we ignore P1). This reduces redundant edges to make the graph closer to its transitive closure (we could also use the existing reachability to compute the actual transitive closure, but this may or may not be worth is).
  2. If there is only one auth event left and its in the same chain, stop.
  3. Otherwise, for each auth event not in the same chain get their chain/sequence number and check if that is reachable by the new event's chain already. If not then we need to update the index (what we insert depends on the type of index being used).

@erikjohnston
Copy link
Member

An implementation that uses the index from 4 takes about ~100ms on my laptop with postgres compared to ~1s with the old algorithm. So ~10x improvement.

The index tables combined have about are about a third the size of event_auth, which feels acceptable.

@erikjohnston
Copy link
Member

Next steps here is to figure out how to handle existing rooms that don't have the index. A naive solution would be to default to the old algorithm for existing rooms, but that would suck for existing servers and rooms.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants