Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

Closed
poljar opened this issue Jun 22, 2021 · 15 comments · Fixed by #1315 or #2805
Closed

OTK fetching failure woes (attempting to establish Olm sessions too often) #281

poljar opened this issue Jun 22, 2021 · 15 comments · Fixed by #1315 or #2805
Assignees

Comments

@poljar
Copy link
Contributor

poljar commented Jun 22, 2021

Sending a message in a room is a three step process:

  1. Collect all the devices that we don't have an established Olm session with and claim one-time keys for them
  2. Encrypt a room key for all the devices that are participating in the room and send the key via a to-device message
  3. Encrypt and send the message

Step 1 and 2 can be later on skipped if all the devices received the room key, sadly step 1 can fail to establish an Olm session if a particular devices has been depleted of their one-time keys, this is fixed by introducing a fallback key but not every device uploads a fallback key yet.

This means that step 1 will be retried for every message we send out if a room has such a depleted device, in test scenarios this isn't terribly slow but the request to claim one-time keys can take a while. This manifests itself as slowness while we send room messages.

The server doesn't notify us if a device uploaded a set fresh of one-time keys so we'll have to introduce a timeout instead.

Remember when we claimed keys for user/device pairs and if a certain timeout didn't pass don't try again, all of this can stay in memory since we don't expect a large amount of dead devices and retrying when we restart the client sounds sensible as well.

@dkasak
Copy link
Member

dkasak commented Oct 20, 2021

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

@dkasak
Copy link
Member

dkasak commented Oct 20, 2021

Also, if it fails on message N but we only manage to fetch OTKs and establish an Olm channel on message N+M, we'll only send the room key in the N+M ratcheted state. We should instead send it in state N.

@dkasak dkasak changed the title The crypto layer tries to establish Olm sessions too often OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state) Oct 20, 2021
@poljar poljar added this to the Stabilize the crypto crate milestone Nov 9, 2021
@poljar
Copy link
Contributor Author

poljar commented Dec 22, 2022

Alternative idea to timeout: use an exponential backoff algorithm for retrying. If fetching OTKs fails on message N, retry on messages N+1, N+2, N+4, N+8, ... The factor should be fine-tuned, of course.

One flaw with this approach is that you might have the app suspended for quite a while. A significant amount of time would pass. The timeout approach would notice this and retry sooner, while the exponential backoff counting the message numbers would not.

I agree that it would be neat to make this completely deterministic based on the message number, but I don't think that it would produce the desired behavior.

@ara4n
Copy link
Member

ara4n commented Jan 2, 2023

See also element-hq/element-web#24138 of how not to do it in EW

@poljar
Copy link
Contributor Author

poljar commented Jan 2, 2023

What I implemented, albeit not yet pushed, is an exponential backoff with a max timeout of 15 minutes. That's handling the failures field for the /keys/query as well as for the /keys/claim endpoint. The failures field gets populated with homeservers that could not have been reached.

The other failure mode that we still need to handle is before this issue can be closed is OTK exhaustion.
Sadly it doesn't seem that the server explicitly tells you that the device is gone or that it doesn't have any OTKs. So we can't just reuse the above mentioned solution.

But it's better to retry too often instead of what EW does, especially in this age of fallback keys where OTK exhaustion is a relic of history.

@poljar
Copy link
Contributor Author

poljar commented May 26, 2023

Re-opening since the OTK-exhaustion case is still posing problems and wasn't handled as part of #1315.

Since the server won't tell us which user/device pairs don't have an OTK we'll have to remember the users for which we sent out a /keys/claim request.

I think that one way to handle this nicely API-wise is to use the following pattern:

struct MissingSessions {
     users: BTreeMap<UserId, DeviceId>,
     machine: OlmMachine,
}

impl MissingSessions {
    pub fn request(&self) -> KeysClaimRequest {
        ...
    }

    pub fn mark_request_as_sent(&self, response: &KeysClaimResponse) -> Result<(), CryptoStoreError> {
        self.machine.receive_keys_query(self.users, response)
    }
}

impl OlmMachine {
    pub fn get_missing_sessions(
        &self,
        impl IntoIterator<Item = &UserId>,
    ) -> Result<MissingSessions, CryptoStoreError> {
        ...
    }
}

@poljar poljar reopened this May 26, 2023
@BillCarsonFr
Copy link
Member

BillCarsonFr commented May 26, 2023

For context here is a rageshake from a user with a device that has no otk (including no fallback).
So every messages he sent will again trigger a keys/claim.
The keys/claim call can take 50ms, 80ms, 100ms, 200ms and even an occurance at 3793ms (claimMissingKeys took: 3793 ms)

The problematic device is some EAX build.
The keys/claim reply is

{
    "one_time_keys": {},
    "failures": {}
}

@richvdh
Copy link
Member

richvdh commented Oct 16, 2023

@poljar:

sending the wrong ratchet state

This sounds very bad, and I don't see any explanation of it in the issue. Can you clarify?

@dkasak
Copy link
Member

dkasak commented Oct 16, 2023

@richvdh: It's referring to the contents of this comment.

@richvdh
Copy link
Member

richvdh commented Oct 16, 2023

ah the wrong megolm ratchet. I thought it meant Olm ratchet!

@dkasak dkasak changed the title OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong ratchet state) OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong Megolm ratchet state) Oct 16, 2023
@richvdh
Copy link
Member

richvdh commented Oct 26, 2023

Out of interest I looked at the anatomy of a slow /keys/claim request. My thoughts over at matrix-org/synapse#16554. TL;DR: synapse could do this faster. Still, that's not really the point for this issue.

@richvdh
Copy link
Member

richvdh commented Oct 26, 2023

I am tempted to write an MSC to modify /keys/claim to say that it must return an empty dict for any devices which were unknown or which have run out of OTKs. TBD what to do about federation: can we trust remote servers to behave themselves in this respect?

Will think on it a little more.

@richvdh
Copy link
Member

richvdh commented Nov 1, 2023

So I opened an MSC (matrix-org/matrix-spec-proposals#4072), but I think it's actually pointless, and we might as well fix this on the client side.

@dkasak
Copy link
Member

dkasak commented Nov 20, 2023

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

I realise the original issue may have been a bit confusing due to bundling these concerns, but we shouldn't be losing information like this.

@dkasak dkasak changed the title OTK fetching failure woes (attempting to establish Olm sessions too often, sending the wrong Megolm ratchet state) OTK fetching failure woes (attempting to establish Olm sessions too often) Nov 20, 2023
@richvdh
Copy link
Member

richvdh commented Nov 20, 2023

Can we please avoid closing partially resolved issues without splitting off the remainder into a separate issue? Talking about our inability to correctly heal from a transient network failure because we'll send the incorrect (advanced) Megolm ratchet state.

Sorry, I didn't realise there was any outstanding work left on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment