Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State resets are hard to recover from #412

Open
kegsay opened this issue Mar 18, 2024 · 3 comments
Open

State resets are hard to recover from #412

kegsay opened this issue Mar 18, 2024 · 3 comments
Labels
bug Something isn't working proxy-only Proxy limitation

Comments

@kegsay
Copy link
Member

kegsay commented Mar 18, 2024

This issue is using the term "state reset" to refer to the situation where Synapse recalculates room state incorrectly and sends very old room state down /sync.

The proxy is a Matrix client like any other client (e.g Element-Web). This means it is vulnerable to state resets like any other client. Unlike clients however (which can just "clear cache and reload") there is no mechanism to recover from a state resetted room.

This can manifest as rooms spontaneously appearing/disappearing, based on historical state. This is made worse because you can't just do an initial sync on the affected clients and have it self-heal because this is the typical failure mode:

  • The proxy is told the correct state
  • The proxy is told incorrect old state. The proxy doesn't know it's old as it's not in its DB and treats this as new.
  • An initial sync which returns this room returns the correct state.
  • The proxy ignores the correct state because it has already seen it, and has seen it was superceded by the old state. This could be handled better if MSC4033 was a thing (since the proxy could see the old state has a lower ordering and hence refuse to update the current state).

A native implementation would not have this problem because it does not rely on /sync v2's state calculations.

@kegsay kegsay added bug Something isn't working proxy-only Proxy limitation labels Mar 18, 2024
@kegsay
Copy link
Member Author

kegsay commented Mar 18, 2024

This has been encountered in the wild by @frebib:nerdhouse.io in #synapse:matrix.org, which amusingly was also the room referred to in element-hq/synapse#8629 - in this case it manifested as as disappearing room.

@kegsay
Copy link
Member Author

kegsay commented May 21, 2024

This can cause UTDs if it happens in an E2EE room.

@kegsay
Copy link
Member Author

kegsay commented May 23, 2024

This is ultimately a synapse bug, but the proxy may be able to mitigate against the worst of this. There's a few heuristics that can be applied:

  • when replacing room state, check origin_server_ts and assert that the old event has a lower timestamp and the new event has a higher timestamp. Allow a clock skew buffer e.g 24h.
  • If the "new" state has an older origin_server_ts (>24h) then flag this room as potentially state reset, along with the number of state events affected. Yell about it in Sentry.

For potentially state reset rooms:

  • determine the true current state for member events:
    • If the amount of potentially reset events < N, hit /state/m.room.member/$user_id
    • If the amount of potentially reset events >=N, hit /joined_members.
  • If the "new" membership doesn't match the membership returned here, then the "new" membership is a state reset, so drop it.
  • If the "new" membership does match the membership here, then it's a genuine event from a server with a very bad clock skew, allow it.

This means there's two tasks:

  • Detect when state resets happen and yell about it.
  • Accept/reject state based on additional HS queries.

I've mentioned a detection threshold of 24h. If this is too high then we won't catch all state resets. If this is too low then we'll catch clock skewed HSes and cause additional traffic on the HS to re-query state. We should begin tracking every state update that breaks temporal causality (that is, the update has a lower timestamp than the state being replaced), so we can monitor what value would be appropriate.

We need to protect against race conditions on the /members queries, so any potential state reset will need to cause the poller to temporarily stop polling to reconcile the true state.

Finally, we need to ensure that a persistently clock skewed HS (malicious or not) cannot cause a DoS on the proxy. This may involve dumping updates for that room into a temporary holding area. A true state reset will be sent to all pollers, meaning we MUST de-duplicate the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working proxy-only Proxy limitation
Projects
None yet
Development

No branches or pull requests

1 participant