New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State Resolution: Reloaded MSC #1441

Merged
merged 8 commits into from Aug 30, 2018

Conversation

Projects
5 participants
@erikjohnston
Copy link
Member

erikjohnston commented Jul 20, 2018

Rendered

Can I encourage people to do inline comments for formatting/wording issues, and thoughts and replies about the ideas should be full blown comments.

Matrix Room: https://matrix.to/#/#msc-1442:jki.re

@ara4n

This comment has been minimized.

Copy link
Member

ara4n commented Jul 20, 2018

(shouldn't the file be named 1442-state-resolution.md to avoid us going mad?)

@ara4n ara4n added the proposal-pr label Jul 20, 2018

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Jul 26, 2018

I'm trying to understand why the algorithm needs to include the auth difference. It seems to make sense to do so, but I'm having a hard time coming up with an explanation for why.

@erikjohnston

This comment has been minimized.

Copy link
Member Author

erikjohnston commented Jul 26, 2018

@uhoreg I'm trying to understand why the algorithm needs to include the auth difference. It seems to make sense to do so, but I'm having a hard time coming up with an explanation for why.

It's to solve the problem where you have e.g. chains of power level events, like Alice gives Bob power (A), then Bob gives Charlie power (B) and then Charlie, say, changes the ban level (C). If you try and resolve two state sets one of which has A and the other has C, C will never pass auth since Charlie doesn't have power in A. If you pull in the auth difference (B) then you get A -> B -> C, which does pass auth.

@ara4n

This comment has been minimized.

Copy link
Member

ara4n commented Jul 26, 2018

^ this is the first time i've understood that; it would be amazing to get it into the MSC

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Jul 26, 2018

Aha, that makes sense. So only events in the auth difference can be in that situation.

I think it would also be helpful to add a note to the MSC about why kicks are included as power events, but leaves are not.

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Jul 28, 2018

I'm trying to understand the differences between the old state resolution algorithm and the new one. Some differences are pretty clear, and it looks like many security issues are addressed directly. One thing that I'm having troubles working out, though, is how it deals with the Hotel California problem. AFAIK, the Hotel California problem doesn't involve any power events, so the join and leave events would share the same closest mainline event in the new algorithm, so it looks like the only difference between the old and new algorithms is that the old algorithm orders them by depth and sha1(eventId), while the new algorithm orders them by timestamp and eventId, which doesn't seem like a significant change. Can you explain/give an example of how the new algorithm fixes the Hotel California problem?

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Aug 1, 2018

The description of the reverse topological power ordering seems to allow for cycles. For example, given three PL events:
A = {auth_events=[(none of B or C)], sender_power_level=75}
B = {auth_events=[(none of A or C)], sender_power_level=50}
C = {auth_events=[B and others (but not A)], sender_power_level=100}

Since B is in the auth chain of C, B < C. Since A and B are not in each other's auth chains, we compare power levels; since A's sender's power level is greater than B's sender's power level, A < B. Since A and C are not in each other's auth chains, we compare power levels; since C's sender's power level is greater than A's power level, C < A. So we have B < C < A < B, which is not a proper order.

Looking at the networkx implementation, it looks like their lexicographical topological sort is an application of Kahn's algorithm with ties broken lexicographically. It may be better to define the reverse topological power ordering in terms of that, rather than trying to define it based on a "<" operator, which I think would end up being more clumsy.

@erikjohnston

This comment has been minimized.

Copy link
Member Author

erikjohnston commented Aug 1, 2018

The description of the reverse topological power ordering seems to allow for cycles

@uhoreg yeah, you're right. I think losing it talking about lexicographic sorting was lost in one of the many edits, I'll add it back, thanks.

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Aug 1, 2018

I think that the algorithm gives the topological sort that is the smallest lexicographically among all the topological sorts. I'm not sure if defining it like that would be more or less confusing.

@erikjohnston

This comment has been minimized.

Copy link
Member Author

erikjohnston commented Aug 2, 2018

Thanks for the clarifying questions folks. It doesn't appear that there are many suggestions/objections to the proposed algorithm, so I am going to propose this MSC enters FCP as is.

There is still a question over whether we should change the auth rules to help make algorithm more resilient, but I think that should be done as a separate MSC.

@erikjohnston

This comment has been minimized.

Copy link
Member Author

erikjohnston commented Aug 2, 2018

I vote that this MSC should enter the final comment period, with the view of merging. Can the rest of the Spec Core Team either approve going into FCP (by ticking the box below) or comment with any concerns.

(Those outside the team can also comment with any new suggestions/concerns)

@turt2live

This comment has been minimized.

Copy link
Member

turt2live commented Aug 2, 2018

Having read it over again, my head suitably hurts. I'm not seeing any changes to describe how Hotel California is fixed though (mentioned in #1441 (comment)). Would be nice to have it addressed, as it's one of the larger pain points for the current algorithm, at least from a user perspective.

@ara4n

This comment has been minimized.

Copy link
Member

ara4n commented Aug 2, 2018

i can’t give a thunbs up for FCP until the questions about the hotel california failure mode are resolved

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Aug 2, 2018

  1. I think the reverse topological power ordering is still a bit unclear. By only saying that ties are broken using the given ordering, I'm not sure if that's enough to unambiguously define how the events in my comment above should be ordered. I think that saying that we use the lexicographically smallest topological sort (maybe defining what exactly that means) might be the clearest thing to do, of course verifying that it's what networkx actually does.
  2. It would be nice to get an answer to #1441 (comment)
  3. In general, I'm wondering if things are defined precisely enough for other people to create their own independent implementation, so I'm (slowly) writing my own implementation to work through all the details. So far, other than the things I've already mentioned, I've only come across relatively minor things, such as, when sorting, specify whether the result is in ascending or descending order. Overall, I'd like to see the final spec to have more details (and I'll come up with a list of things that I think are missing later on), but I don't think that there's anything that would hold up the proposal or the FCP.
  4. In terms of efficiency, I think that there's some opportunity for optimization due to the fact that if one event is in every auth chain, then everything in its auth chain will also be in every auth chain, and so none of them will be in the auth difference. So you may not need to chase the auth events all the way to the first PL. It looks like it may be a bit complicated to do it right, but it could potentially be helpful in rooms with lots of PL changes.

(# 2 is the only blocker for me, with the assumption that 1 and 3 are easily resolvable)

@erikjohnston

This comment has been minimized.

Copy link
Member Author

erikjohnston commented Aug 3, 2018

I've tried to address both the Hotel California issue and expanded on the reverse topological power ordering.

@uhoreg I'm happy to clarify the rules if and when you find ambiguity.

@turt2live

This comment has been minimized.

Copy link
Member

turt2live commented Aug 3, 2018

Have given my approval for the FCP. Thanks for adding a mention for Hotel California :)

@anoadragon453

This comment has been minimized.

Copy link
Member

anoadragon453 commented Aug 3, 2018

How do we check that a server is not lying about the time? I could see disregarding something that is outrageously set in the future, but what about subtle manipulations?

I assume this is accounted for as we only use origin_server_ts in the most dire of scenarios.

@ara4n

This comment has been minimized.

Copy link
Member

ara4n commented Aug 5, 2018

How do we check that a server is not lying about the time?

As you say, we only use timestamps as a tiebreaker if there is no ordering implied by the auth DAG. Unfortunately this is the common case, so we're basically trusting servers not to lie about the time, and if they do, then we mitigate any damage by injecting auth events. It'd obviously be cleaner to consider the actual ordering of the room's state DAG, but that would mean having the full state DAG available - and the conclusion is that for today's use cases at least it's not the end of the world if an op on a malicious server goes and sets a topic from the year 3000 if that can then be promptly overridden via an auth event like de-opping the cranky op. It's not great for the scenario of using Matrix as a general purpose decentralised byzantine fault tolerant key-value database, but I'm expecting we'll see some more experimentation in state resolution algorithms in future, and in the interim MSC1442 is a big enough improvement on what we have today for today's use cases that we should go with it.

On which note, i'm giving a 👍 - @erikjohnston, thanks for spelling out the hotel california stuff.

@turt2live turt2live added this to In review in August 2018 r0 Aug 14, 2018

@uhoreg

This comment has been minimized.

Copy link
Member

uhoreg commented Aug 18, 2018

FWIW, I cleaned up my (partial) implementation a bit and put it up, for people who are weird like me: https://matrix.uhoreg.ca/stateres/reloaded.html I don't know how helpful it will be (in particular, I don't know if this will be more or less confusing for people), but it should at least give a bit of a different look at some things.

I probably won't spend much more time on it, as I think I've implemented enough to understand it sufficiently, but I do plan on eventually finishing it off, though it might be a while.

@turt2live

This comment has been minimized.

Copy link
Member

turt2live commented Aug 30, 2018

We now have a spec PR for this over at #1610. Feel free to comment on #1610 with any major objections.

@turt2live turt2live merged commit 2f55dd6 into master Aug 30, 2018

5 checks passed

ci/circleci: build-docs Your tests passed on CircleCI!
Details
ci/circleci: build-swagger Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
docs Click details to preview the HTML documentation.
Details
swagger Click to preview the swagger build.
Details

August 2018 r0 automation moved this from In review (just the PRs) to Done (this list will be incomplete) Aug 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment