State resets are hard to recover from #412

kegsay · 2024-03-18T17:05:55Z

This issue is using the term "state reset" to refer to the situation where Synapse recalculates room state incorrectly and sends very old room state down /sync.

The proxy is a Matrix client like any other client (e.g Element-Web). This means it is vulnerable to state resets like any other client. Unlike clients however (which can just "clear cache and reload") there is no mechanism to recover from a state resetted room.

This can manifest as rooms spontaneously appearing/disappearing, based on historical state. This is made worse because you can't just do an initial sync on the affected clients and have it self-heal because this is the typical failure mode:

The proxy is told the correct state
The proxy is told incorrect old state. The proxy doesn't know it's old as it's not in its DB and treats this as new.
An initial sync which returns this room returns the correct state.
The proxy ignores the correct state because it has already seen it, and has seen it was superceded by the old state. This could be handled better if MSC4033 was a thing (since the proxy could see the old state has a lower ordering and hence refuse to update the current state).

A native implementation would not have this problem because it does not rely on /sync v2's state calculations.

The text was updated successfully, but these errors were encountered:

kegsay · 2024-03-18T17:07:48Z

This has been encountered in the wild by @frebib:nerdhouse.io in #synapse:matrix.org, which amusingly was also the room referred to in element-hq/synapse#8629 - in this case it manifested as as disappearing room.

kegsay · 2024-05-21T08:30:31Z

This can cause UTDs if it happens in an E2EE room.

kegsay · 2024-05-23T10:43:44Z

This is ultimately a synapse bug, but the proxy may be able to mitigate against the worst of this. There's a few heuristics that can be applied:

when replacing room state, check origin_server_ts and assert that the old event has a lower timestamp and the new event has a higher timestamp. Allow a clock skew buffer e.g 24h.
If the "new" state has an older origin_server_ts (>24h) then flag this room as potentially state reset, along with the number of state events affected. Yell about it in Sentry.

For potentially state reset rooms:

determine the true current state for member events:
- If the amount of potentially reset events < N, hit /state/m.room.member/$user_id
- If the amount of potentially reset events >=N, hit /joined_members.
If the "new" membership doesn't match the membership returned here, then the "new" membership is a state reset, so drop it.
If the "new" membership does match the membership here, then it's a genuine event from a server with a very bad clock skew, allow it.

This means there's two tasks:

Detect when state resets happen and yell about it.
Accept/reject state based on additional HS queries.

I've mentioned a detection threshold of 24h. If this is too high then we won't catch all state resets. If this is too low then we'll catch clock skewed HSes and cause additional traffic on the HS to re-query state. We should begin tracking every state update that breaks temporal causality (that is, the update has a lower timestamp than the state being replaced), so we can monitor what value would be appropriate.

We need to protect against race conditions on the /members queries, so any potential state reset will need to cause the poller to temporarily stop polling to reconcile the true state.

Finally, we need to ensure that a persistently clock skewed HS (malicious or not) cannot cause a DoS on the proxy. This may involve dumping updates for that room into a temporary holding area. A true state reset will be sent to all pollers, meaning we MUST de-duplicate the work.

kegsay added bug Something isn't working proxy-only Proxy limitation labels Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State resets are hard to recover from #412

State resets are hard to recover from #412

kegsay commented Mar 18, 2024 •

edited

Loading

kegsay commented Mar 18, 2024

kegsay commented May 21, 2024

kegsay commented May 23, 2024

State resets are hard to recover from #412

State resets are hard to recover from #412

Comments

kegsay commented Mar 18, 2024 • edited Loading

kegsay commented Mar 18, 2024

kegsay commented May 21, 2024

kegsay commented May 23, 2024

kegsay commented Mar 18, 2024 •

edited

Loading