Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Room events arrive faster than to-device messages during federation connectivity problems, causing decryption failures #1123

Open
Tracked by #245
turt2live opened this issue Jun 14, 2022 · 10 comments
Labels
A-E2EE Issues about end-to-end encryption feature Suggestion for a significant extension which needs considerable consideration

Comments

@turt2live
Copy link
Member

turt2live commented Jun 14, 2022

Suppose server A is struggling to send federation traffic to server B. (This might be because B is slow, is having connectivity problems, or there is just a huge queue of traffic.)

In this situation, B will often still receive encrypted messages from A: they may be pulled in from other servers in the room. However, the keys to these messages will be stuck in a queue on A. Users on B therefore see decryption errors.

ref element-hq/element-web#3754

@turt2live turt2live added wart A point where the protocol is inconsistent or inelegant A-E2EE Issues about end-to-end encryption labels Jun 14, 2022
@richvdh richvdh changed the title if your server goes down, when it comes back you get messages long before the keys Room events arrive faster than to-device messages during federation connectivity problems, causing decruption failures Jan 13, 2023
@richvdh richvdh changed the title Room events arrive faster than to-device messages during federation connectivity problems, causing decruption failures Room events arrive faster than to-device messages during federation connectivity problems, causing decryption failures Jan 13, 2023
@richvdh richvdh added feature Suggestion for a significant extension which needs considerable consideration and removed wart A point where the protocol is inconsistent or inelegant labels Jan 13, 2023
@MadLittleMods
Copy link
Contributor

Related to #966 (potential duplicate)

@richvdh
Copy link
Member

richvdh commented Apr 12, 2023

Related to #966 (potential duplicate)

It's not a duplicate: #966 talks about the ordering of to-device messages relative to one another, while this is about to-device messages relative to room events.

@dkasak
Copy link
Member

dkasak commented May 16, 2023

According to https://matrix-org.github.io/synapse/dev-docs/latest/modules/federation_sender.html#a-note-on-failures-and-back-offs, Synapse already mitigates this to some degree by restarting the to-device transmission loop once it receives an inbound request from the remote server.

@ara4n
Copy link
Member

ara4n commented Oct 6, 2023

The core problem here is that todevice messages don't flow transitively, but timeline msgs do. So if server A can't route to server B, messages may all flow transitively A->C->B - but the keys will never get through, no matter what the retry schedule is.

Possible solutions off the top of my head:

  • Forbid transitive backfill for encrypted rooms
  • Let server B request missing todevice traffic from server A via server C. So:
    • server B realises it's receiving current events from server A via server C
    • server B assumes that A can't route to it directly
    • server B asks server C to ask server A for B's todevice msgs.

The latter is more fiddly, but preserves Matrix's somewhat desirable self-healing transitive delivery properties.

Alternatively, does MLS solve this by putting the key data in the timeline rather than todevice msgs? (cc @uhoreg)

@uhoreg
Copy link
Member

uhoreg commented Oct 7, 2023

Most of the MLS stuff happens in-room. The one thing that can happen in to-device messages is Welcome messages, but those could also be sent in-room. We haven't figured out if it would be better to use to-device messages or in-room messages for that.

@dkasak
Copy link
Member

dkasak commented Oct 8, 2023

Presumably Matrix could've gotten away with putting it key data into the room too but for some reason chose not to. Are there any concerns with MLS key data being stored in the DAG forever? E.g. bloat or problems arising out of making permanent something that should be ephemeral?

@uhoreg
Copy link
Member

uhoreg commented Oct 10, 2023

Yeah, the main concerns about storing the MLS Welcome messages in the DAG are bloat-related. It would be irrelevant information to most of the people in the room.

@uhoreg
Copy link
Member

uhoreg commented Jan 19, 2024

We may be able to flag events that we got relayed from a different server, so that clients can indicate that there may be a decryption failure.

@kegsay
Copy link
Member

kegsay commented Jan 26, 2024

I think there are two separate issues here:

  • network partitions over federation i.e no amount of waiting will give you the to-device msgs.
  • temporary lag where to-device msgs take ages to arrive.

The latter is quite frequent, and we see this (almost) daily between element.io <--> matrix.org. If we renegotiated room keys less frequently (or had more time to let the room key arrive by not sending it when typing..) this would concretely help I think.

@uhoreg
Copy link
Member

uhoreg commented Jan 31, 2024

(or had more time to let the room key arrive by not sending it when typing..)

I don't understand what you mean by this. By sending it when typing, that means that we send it earlier, which means it has more time to arrive before the room event is received?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE Issues about end-to-end encryption feature Suggestion for a significant extension which needs considerable consideration
Projects
None yet
Development

No branches or pull requests

7 participants