MSC4081: Eagerly sharing fallback keys with federated servers #4081

kegsay · 2023-11-21T17:50:26Z

Rendered

proposals/4081-claim-fallback-keys-on-network-failure.md

uhoreg · 2023-11-24T16:51:44Z

proposals/4081-claim-fallback-keys-on-network-failure.md

+> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and
+> must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a
+> room which contains servers which are not already receiving updates for that user’s device list, or changes in
+> device information such as the device’s human-readable name **or fallback key**).


One potential issue that I can see with pre-sending fallback keys is something like this scenario:

Alice uploads fallback key A, which gets sent to Bob's server

Bob's server goes down for a while

Alice receives some olm messages that use the fallback keys, so rotates her keys, first to fallback B and then to fallback C. At this point, she has evicted private key A. Since Bob's server is down, it doesn't receive the new fallback keys.

Alice's server goes down for a while, and then Bob's server comes back up

Bob tries to establish an Olm session with Alice, receives fallback A, and sends an encrypted message to Alice

Alice's server comes back up, and Alice receives the message from Bob, but can't decrypt since she doesn't have private key A any more

This is worse than the situation where Bob doesn't get any OTK for Alice, since if he doesn't receive any key, he knows about the failure and can retry later. On the other hand, this is likely a very rare scenario, so may not be worth worrying about.

This could be mitigated by either:

signalling to the sender that the session was undecryptable. Whilst this can be prompt, I would be worried about DoS and oracle-like attacks though. DoS because now attackers have a way to cause clients to send traffic automatically by sending keys the client doesn't have. Oracle because an attacker can send various keys and know if the client has the private key on-disk still, i.e it exposes whether the key has been evicted or not which is a critical part of forward secrecy.

treating fallback-initiated sessions as unreliable, and hence if you do not get an established session after time N, try to claim another OTK and try again? Unsure of the security implications here, as an attacker may be able to make use of the fact that there may be >1 established Olm session?

I think this scenario is: "what if Bob isn't told that Alice has rotated her fallback key, and tries to use a stale cached fallback key". This is very similar to the "what if Alice restores her server from backup, and starts handing out stale OTKs" failure mode.

I think we have to consider these as wedged sessions, and keep trying to retry setup from the client (with your server nudging you to retry by waking you up with a push, or similar, when the server sees that the remote server has updated its device list).

Agreed, this is basically the same as OTK reuse where the session is encrypted for a key the client has dropped. The frequency of these is quite different though.

Server backups will reliably cause OTK reuse currently, because we have no mechanism to tell clients that they need to reupload all their OTKs again (and even if we did, there would be a race condition where some user has claimed it during the reupload process). As a result, if during the bad deployment a user claimed 5 keys, upon rollback, the next 5 OTKs will be bad due to key reuse, guaranteed.

what if Bob isn't told that Alice has rotated her fallback key

For this to happen the two servers need to be partitioned for time N where N is the time between uploading a new fallback key and deleting the old fallback key on the client. N is configurable, and X3DH provides an example interval of "once a week, or once a month". Looking at the Android source it seems they try to cycle it every 2 days, with a max time of 14 days. This feels like a long enough time for most ephemeral network partitions to resolve themselves. Therefore, the likelihood of actually seeing this is much more remote.

N is configurable, and X3DH provides an example interval of "once a week, or once a month". Looking at the Android source it seems they try to cycle it every 2 days, with a max time of 14 days.

As discussed in the other thread, there is confusion here between "how often do we create a new key" and "how long do we keep old keys, having created a new one". The X3DH spec just says "eventually" for the latter:

After uploading a new signed prekey, Bob may keep the private key corresponding to the previous signed prekey around for some period of time, to handle messages using it that have been delayed in transit. Eventually, Bob should delete this private key for forward secrecy.

I think the Signal Android app is using 30 days for this period (https://github.com/signalapp/Signal-Android/blob/940cee0f30d6a2873ae08c65bb821c34302ccf5d/app/src/main/java/org/thoughtcrime/securesms/crypto/PreKeyUtil.java#L210-L239).

Nevertheless, I agree with the principles in this discussion: If Alice and Bob's servers manage to overlap their downness for long enough, then yes Bob will be unable to message Alice. But that's an extreme case, and I don't think it should prevent us making this incremental improvement in reliability even if we still end up bolting on retries later on.

Not quite sure what to do with this thread. Maybe I should add a section to the MSC to call out the issue?

I suspect that the issue should be rare enough that it's not worth trying to solve it right now. So at most, add something in the MSC that says something about it.

uhoreg · 2023-11-24T17:00:40Z

proposals/4081-claim-fallback-keys-on-network-failure.md

@@ -0,0 +1,139 @@
+# MSC4081: Claim fallback key on network failures


@ara4n's comment somehow ended up on the commit rather than on the PR, so copying here:

I'm a bit worried about this: it's (nominally) weakening security in order to work around network reliability issues. it reminds me of our misadventures in key gossiping, where we similarly weaken security to mainly work around bad retry mechanisms and network unreliability.

If our server can't talk to the other server, i wonder if we should warn the sender (e.g. a "can't contact bob.com!" warnings on the message) and then retry? the sender will know to keep the app open while it tries to retry (just as they would if they were stuck sending the message too)? This feels better than to give up and send the message with (nominally) lower security, and could also make the app feel more responsive with appropriate UX (i.e. rather than being stuck in 'sending' state for ages while a /key/claims times out, it could declare itself sent to 10 out of 11 servers, or similar).

ah, thanks for rescuing it - GH mobile app doing weird things. i was about to rewrite it from scratch.

Can we clarify to what extent it weakens security? The use of fallback keys is I think mitigated by the fact that the double ratchet will do a DH step on the next message and that will restore security.
Fallback key exists so that communication do not break when all OTKs are exausted (as of convenience and it's mitigated), why can't also they be used for transiant federation connectivity problems?
Maybe they could be modified to have a ttl?
And in case of replay attack (same prekey message sent other and other), couldn't the client add some additional mitigiations (as it could already BTW)? Like detecting abusive use of fallback?

I'd like to re-emphasise the nominal reduction in security: in reality there is negligible impact, further reinforced by other secure protocols (Signal in this case) allowing OTKs to be optional in the setup phase. I think this MSC is overall net positive, as it makes the protocol more robust, and fixes concrete bugs we've seen in the wild.

I guess the important thing is to emphasise in the threat model is that OTKs are security theatre whatever once you introduce fallback keys - given an attacker can force use of the fallback key by both exhausting the OTK pool (which leaves an audit trail), as well as simply deny the network (which doesn't leave an audit trail).

So, it feels like the only reason end up we keep OTKs is:
a) To enjoy their (nominal) security properties for paranoid deployments which disable fallback keys
b) To keep exercising the OTK code path, even when fallback keys are around, to help stop it regressing for deployments where fallback keys are disabled.

In which case, yes, perhaps this MSC isn't as bad as it felt at first.

I guess the important thing is to emphasise in the threat model is that OTKs are security theatre whatever once you introduce fallback keys - given an attacker can force use of the fallback key by both exhausting the OTK pool (which leaves an audit trail), as well as simply deny the network (which doesn't leave an audit trail).

So, it feels like the only reason end up we keep OTKs is: a) To enjoy their (nominal) security properties for paranoid deployments which disable fallback keys b) To keep exercising the OTK code path, even when fallback keys are around, to help stop it regressing for deployments where fallback keys are disabled.

That's not quite right. As I see it, OTKs guard against a passive attacker, who has nevertheless managed to snarf the network data, and then later gets access to [the data on] Bob's device. You don't have to be paranoid and disable fallback keys to benefit from them. I've linked to https://crypto.stackexchange.com/a/52825 in the doc, as I think it really helps explain this.

So yes, an attacker with access to the network between homeservers can now force use of a fallback key where previously no communication would happen at all. But it's far easier to claim all the OTKs than it is to get access to the network to block that /claim request, so I'm not sure it's really moving the needle?

kegsay · 2024-02-02T12:24:42Z

Synapse will also need element-hq/synapse#16875 in addition to MSC4081 to allow clients to start Olm sessions when the server is down.

proposals/4081-claim-fallback-keys-on-network-failure.md

updated

ara4n · 2024-02-28T23:49:38Z

proposals/4081-claim-fallback-keys-on-network-failure.md

+significant delay between the old key being used to encrypt a message and that message being received at the
+recipient, and MSC2732's recommendation (the lesser of "as soon as the new key is used" and 1 hour) is inadequate
+We therefore recommend significantly increasing the period for which an old fallback key is kept on the client, to
+30 days after the key was replaced, but making sure that at least one old fallback key is kept at all


I think this means that after 30d of netsplit, a user on a server which has cached an old fallback key will no longer be able to establish an Olm session to you?

This feels like a pretty major limitation which should be called out - and communicated to the sending user when it happens?

I still like the idea of warning the user in general what users they can’t communicate with (due to no OTKs, or due to expired fallback keys), so the user can go and complain and get the problem solved.

I think it's worth saying that, after 30 days of netsplit, it's a good bet that any to-device messages you send aren't going to be delivered for a while anyway. (In other words: what good does obtaining a fallback key do if you then can't actually send anything you encrypt with it?)

Still, you're not wrong. It's also worth saying that this 30 days is entirely under the client's control; so if you are working in an environment where you expect your homeserver to go incommunicado for 3 months, perhaps you can configure your client to keep old fallback keys that long.

As for detecting and reporting the situation to the user: yes, that might be nice, if only so that it can end up in a rageshake (I can hear @pmaier1's voice in my head saying "users don't want to be bothered with this sort of technical detail!"). The problem is, how can we detect it? The problem is that Alice (who is using the fallback key) has no way of knowing that Bob has expired that key, because they are netsplit. Timestamping the keys doesn't help, because an old key could also mean that Bob hasn't used his client for 30 days. We could maybe detect the situation once Bob comes back online, but that doesn't help the user get the problem solved in the first place.

IMHO this starts to get into questions about "how long is it reasonable for clients/servers to be offline/unreachable, and still expect messages to get delivered". FB Messenger, for example, actually logs out any devices that are unused after 30 days. We don't necessarily need to go that far, but "forever" seems an unreasonable expectation; if we actually set some expectations then we could work towards having sensible UX when that time expires.

IMHO this starts to get into questions about "how long is it reasonable for clients/servers to be offline/unreachable, and still expect messages to get delivered". FB Messenger, for example, actually logs out any devices that are unused after 30 days. We don't necessarily need to go that far, but "forever" seems an unreasonable expectation; if we actually set some expectations then we could work towards having sensible UX when that time expires.

💯 - if we actually bothered to do this then maybe we could clear out the to_device_inbox at some point...

I disagree that "I still like the idea of warning the user in general what users they can’t communicate with (due to no OTKs, or due to expired fallback keys), so the user can go and complain and get the problem solved." is a useful property to be trying to preserve here. If I'm on a HS and it is down, then you cannot talk to me no matter what you try to do. In some cases there may be an alternative way of contacting me, to which I can either A) thank you for being a real-life PagerDuty and restart the server or B) shrug and say I don't actually control it, as it's on $homeserver I don't control.

proposals/4081-claim-fallback-keys-on-network-failure.md

kegsay · 2024-02-29T09:38:56Z

proposals/4081-claim-fallback-keys-on-network-failure.md

+significant delay between the old key being used to encrypt a message and that message being received at the
+recipient, and MSC2732's recommendation (the lesser of "as soon as the new key is used" and 1 hour) is inadequate
+We therefore recommend significantly increasing the period for which an old fallback key is kept on the client, to
+30 days after the key was replaced, but making sure that at least one old fallback key is kept at all


IMHO this starts to get into questions about "how long is it reasonable for clients/servers to be offline/unreachable, and still expect messages to get delivered". FB Messenger, for example, actually logs out any devices that are unused after 30 days. We don't necessarily need to go that far, but "forever" seems an unreasonable expectation; if we actually set some expectations then we could work towards having sensible UX when that time expires.

💯 - if we actually bothered to do this then maybe we could clear out the to_device_inbox at some point...

I disagree that "I still like the idea of warning the user in general what users they can’t communicate with (due to no OTKs, or due to expired fallback keys), so the user can go and complain and get the problem solved." is a useful property to be trying to preserve here. If I'm on a HS and it is down, then you cannot talk to me no matter what you try to do. In some cases there may be an alternative way of contacting me, to which I can either A) thank you for being a real-life PagerDuty and restart the server or B) shrug and say I don't actually control it, as it's on $homeserver I don't control.

kegsay · 2024-02-29T09:45:07Z

proposals/4081-claim-fallback-keys-on-network-failure.md

+
+2. Clients could remember that they were unable to claim keys for a given device, and retry periodically. The main
+   problem with this approach (other than increased complexity in the client) is that it requires the sending
+   client to still be online when the remote server comes online, and to notice that has happened. There may be


It basically rules out asynchronous communication when initially talking to someone, as both HSes need to be online at exactly the same time. This feels very suboptimal and rules out the ability to run E2EE Matrix under certain network conditions.

both HSes need to be online at exactly the same time.

Whatever we do, both HSes need to be online at the same time because one has to make an HTTP request to the other. I guess you mean that the sending client, and both HSes, have to be online. In which case, yes I agree with you and this is a succinct summary of why this approach is insufficient.

Yes that's what I meant, sorry.

proposals/4081-claim-fallback-keys-on-network-failure.md

Co-authored-by: kegsay <7190048+kegsay@users.noreply.github.com>

kegsay · 2024-05-28T13:19:59Z

Whilst this MSC would definitely help some situations, we require element-hq/synapse#11374 for fully offline support (so device list updates are eagerly shared too).

kegsay · 2024-07-12T09:15:23Z

When I wrote this MSC originally, I did not think this would be a sufficiently frequent cause of UTDs. However, as we have fixed other UTD causes, this is now a significant chunk of the remaining decryption failures. As such, we should probably try to prototype this and land a solution at some point. See element-hq/synapse#17267 for real world causes which this MSC would fix.

MSC4081: Claim fallback key on network failure

ef28b25

uhoreg reviewed Nov 21, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

uhoreg added proposal A matrix spec change proposal needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. kind:feature MSC for not-core and not-maintenance stuff e2e s2s Server-to-Server API (federation) labels Nov 21, 2023

uhoreg reviewed Nov 21, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

kegsay added 2 commits November 22, 2023 10:33

Formatting; mention increased access to fallback key

770f660

Update 4081-claim-fallback-keys-on-network-failure.md

7e8a59d

richvdh reviewed Nov 22, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

richvdh reviewed Nov 22, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

richvdh reviewed Nov 22, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

richvdh reviewed Nov 22, 2023

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

kegsay mentioned this pull request Nov 22, 2023

A transient failure to establish an Olm session will cause forever undecryptable room messages matrix-org/matrix-rust-sdk#2864

Open

kegsay added 2 commits November 23, 2023 11:15

Clarity around when fallback keys count as used

9e9fa88

More clarity

84079ea

uhoreg reviewed Nov 24, 2023

View reviewed changes

kegsay mentioned this pull request Nov 27, 2023

Full stack E2EE Testing MEGAISSUE element-hq/element-meta#2165

Open

kegsay mentioned this pull request Jan 11, 2024

E2EE should recover rapidly from HS outages element-hq/element-meta#2153

Closed

richvdh mentioned this pull request Jan 12, 2024

Users whose servers were unreachable will receive undecryptable messages due to failed OTK claim element-hq/element-meta#2154

Open

11 tasks

richvdh previously requested changes Feb 27, 2024

View reviewed changes

richvdh and others added 2 commits February 28, 2024 18:00

Apply suggestions from code review

c32227e

address review comments

ea992d0

ara4n reviewed Feb 28, 2024

View reviewed changes

richvdh changed the title ~~MSC4081: Claim fallback key on network failure~~ MSC4081: Eagerly sharing fallback keys with federated servers Feb 29, 2024

kegsay commented Feb 29, 2024

View reviewed changes

richvdh reviewed Feb 29, 2024

View reviewed changes

proposals/4081-claim-fallback-keys-on-network-failure.md Outdated Show resolved Hide resolved

richvdh and others added 2 commits February 29, 2024 13:16

Apply suggestions from code review

5262868

Co-authored-by: kegsay <7190048+kegsay@users.noreply.github.com>

Update 4081-claim-fallback-keys-on-network-failure.md

e8e5b85

kegsay mentioned this pull request Jun 7, 2024

Backoffs and 429s on /keys/claim over federation causes UTDs element-hq/synapse#17267

Open

Give examples of "unreachable" servers

9404052

BillCarsonFr mentioned this pull request Jul 2, 2024

MSC4162: One-Time Key Reset Endpoint #4162

Open

MarcWadai mentioned this pull request Jul 9, 2024

Can we have more info about why messages are undecrypted ? tchapgouv/tchap-web-v4#904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC4081: Eagerly sharing fallback keys with federated servers #4081

MSC4081: Eagerly sharing fallback keys with federated servers #4081

kegsay commented Nov 21, 2023 •

edited

Loading

uhoreg Nov 24, 2023

kegsay Nov 27, 2023

ara4n Jan 13, 2024

kegsay Jan 24, 2024

richvdh Feb 28, 2024

richvdh Feb 28, 2024

uhoreg Feb 29, 2024

uhoreg Nov 24, 2023

ara4n Nov 27, 2023

BillCarsonFr Nov 27, 2023 •

edited

Loading

kegsay Nov 27, 2023 •

edited

Loading

ara4n Jan 13, 2024

richvdh Feb 28, 2024

kegsay commented Feb 2, 2024

ara4n Feb 28, 2024

richvdh Feb 29, 2024 •

edited

Loading

kegsay Feb 29, 2024

kegsay Feb 29, 2024

kegsay Feb 29, 2024

richvdh Feb 29, 2024

kegsay Mar 14, 2024

kegsay commented May 28, 2024 •

edited

Loading

kegsay commented Jul 12, 2024

		@@ -0,0 +1,139 @@
		# MSC4081: Claim fallback key on network failures

MSC4081: Eagerly sharing fallback keys with federated servers #4081

Are you sure you want to change the base?

MSC4081: Eagerly sharing fallback keys with federated servers #4081

Conversation

kegsay commented Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BillCarsonFr Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

kegsay Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kegsay commented Feb 2, 2024

Choose a reason for hiding this comment

richvdh Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kegsay commented May 28, 2024 • edited Loading

kegsay commented Jul 12, 2024

kegsay commented Nov 21, 2023 •

edited

Loading

BillCarsonFr Nov 27, 2023 •

edited

Loading

kegsay Nov 27, 2023 •

edited

Loading

richvdh Feb 29, 2024 •

edited

Loading

kegsay commented May 28, 2024 •

edited

Loading