Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC3814: Dehydrated devices with SSSS #3814

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions proposals/3814-dehydrated-devices-with-ssss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# MSC3814: Dehydrated Devices with SSSS

[MSC2697](https://github.com/matrix-org/matrix-doc/pull/2697) introduces device
dehydration -- a method for creating a device that can be stored in a user's
account and receive megolm sessions. In this way, if a user has no other
devices logged in, they can rehydrate the device on the next login and retrieve
the megolm sessions.

However, the approach presented in that MSC has some downsides, making it
tricky to implement in some clients, and presenting some UX difficulties. For
example, it requires that the device rehydration be done before any other API
calls are made (in particular `/sync`), which may conflict with clients that
currently assume that `/sync` can be called immediately after logging in.

In addition, the user is required to enter a key or passphrase to create a
dehydrated device. In practice, this is usually the same as the SSSS
key/passphrase, which means that the user loses the advantage of verifying
their other devices via emoji or QR code: either they will still be required to
enter their SSSS key/passphrase (or a separate one for device dehydration), or
else that client will not be able to dehydrate a device.

This proposal introduces another way to use the dehydrated device that solves
these problems by storing the dehydration key in SSSS, and by not changing the
client's device ID. Rather than changing its device ID when it rehydrates the
device, it will keep its device ID and upload its own device keys. The client
will separately rehydrate the device, fetch its to-device messages, and decrypt
them to retrieve the megolm sessions.

## Proposal

### Dehydrating a device

The dehydration process is the same as in MSC2697. For completeness, it is
repeated here:

To upload a new dehydrated device, a client will use `PUT /dehydrated_device`.
Each user has at most one dehydrated device; uploading a new dehydrated device
will remove any previously-set dehydrated device.

`PUT /dehydrated_device`

```jsonc
{
"device_data": {
"algorithm": "m.dehydration.v1.olm"
"other_fields": "other_values"
},
"initial_device_display_name": "foo bar" // optional
}
```

Result:

```json
{
"device_id": "dehydrated device's ID"
}
```

After the dehydrated device is uploaded, the client will upload the encryption
keys using `POST /keys/upload/{device_id}`, where the `device_id` parameter is
the device ID given in the response to `PUT /dehydrated_device`. The request
and response formats for `POST /keys/upload/{device_id}` are the same as those
for `POST /keys/upload` with the exception of the addition of the `device_id`
path parameter.

uhoreg marked this conversation as resolved.
Show resolved Hide resolved
Note: Synapse already supports `POST /keys/upload/{device_id}` as this was used
in some old clients. However, synapse requires that the given device ID
matches the device ID of the client that made the call. So this will be
changed to allow uploading keys for the dehydrated device.

### Rehydrating a device

To rehydrate a device, a client first calls `GET /dehydrated_device` to see if
a dehydrated device is available. If a device is available, the server will
respond with the dehydrated device's device ID and the dehydrated device data.

`GET /dehydrated_device`

Response:

```json
{
"device_id": "dehydrated device's ID",
"device_data": {
"algorithm": "m.dehydration.v1.olm",
"other_fields": "other_values"
}
}
```

If no dehydrated device is available, the server responds with an error code of
`M_NOT_FOUND`, http code 404.

If the client is able to decrypt the data and wants to use the dehydrated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can say something like: the server is allowed to discard any non-m.room.encrypted to-device message that it receives for the dehydrated device. There's no point in keeping key requests sent to the dehydrated device because it won't send anything back.

device, the client retrieves the to-device messages sent to the dehydrated
device by calling `POST /dehydrated_device/{device_id}/events`, where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include a device_id if you can only have a single dehydrated device? It has implications that you can provide more than one (and causes additional error checking of whether the provided device ID matches the dehydrated device ID).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the device ID is redundant. Though since this MSC has been written a new use-case for this endpoint has been found.

In the sliding sync world we have split out the fetching of to-device events into a separate sync loop. Namely one of the biggest problems of the existing /sync mechanism is that you get too much data all at once and the downloading and processing of that data prevents the UI from being updated.

To-device events are one of those things that are not directly related to the things that a client will want to display in a room or room list, so putting it into a separate sync loop allows the main loop to quickly send updates while to-device moves along in the background. More info here: matrix-org/matrix-rust-sdk#1928

I think that old sync could handle such a split as well, so I would suggest here to rename the endpoint to become /sync/to_device/{device_id} where device_id might be optional and used only in the case of a dehydrated device.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the device_id parameter is that, while one client is fetching events, another client could create a new dehydrated device. Without the device_id parameter, the server could think that the client wants to fetch the events for the new device which, if there are any, it won't be able to decrypt since it's for a device that it doesn't know about. With the device_id parameter, the server will at least be able to say that there are no more events (since the device has been replaced by a new one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds quite racy to me -- how does the server know that one dehydrated device is claimed? How would the client know to make a new one instead of claim the old one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the server know that one dehydrated device is claimed?

It's OK for multiple clients to rehydrate the same device (unlike in the previous proposal), because it never becomes a real device. So the server can just wait until some client fetches all the events before dropping the device.

How would the client know to make a new one instead of claim the old one?

Making a new device and rehydrating an old one are two different use cases. Rehydration happens after you log in, and you're setting up encryption and trying to get keys. It only happens once in the device's lifetime. Creating a new dehydrated device would happen after you've already set up your encryption and already attempted to rehydrate a device.

`{device_id}` is the ID of the dehydrated device. Since there may be many
uhoreg marked this conversation as resolved.
Show resolved Hide resolved
messages, the response can be sent in batches: the response can include a
`next_batch` parameter, which can be used in a subsequent call to `POST
/dehydrated_device/{device_id}/events` to obtain the next batch.

```
POST /dehydrated_device/{device_id}/events

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a POST and not a GET like /sync and /messages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the rationale was because the call has side-effects (deleting the device).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that it doesn't follow the pattern of /messages, /events or /sync imo. I'll try implementing it as a GET without the device deletion first and see how that works out, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A GET endpoint with side-effects seems like a big no-no to me. Everyone expects a GET request to have approximately zero side-effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, but we're also proposing removing the side-effects? SGTM in that case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the current implementation no longer automatically deletes the device on the server side, but relies on the client to delete/create a new device. So we're going to try to make this a GET.

{
"next_batch": "token from previous call" // (optional)
}
```

Response:

```jsonc
{
"events": [
// array of to-device messages, in the same format as in
// https://spec.matrix.org/unstable/client-server-api/#extensions-to-sync
],
"next_batch": "token to obtain next events" // optional
}
```

Once a client calls `POST /dehydrated_device/{device_id}/events`, the server
can delete the device (though not necessarily its to-device messages). Once a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the server delete the device? Shouldn't this rather be done by the client explicitly in a delete call?

Imo it is not quite obvious that fetching the events should "break" the device. A client might fail to properly restore and now you lost all the intermediate sessions. instead the client should replace the device once it is somewhat sure it restored successfully and has uploaded the megolm keys to online backup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that when the client starts getting events, it means that the client is signalling its intention to use the dehydrated device, and it has been "claimed", so it shouldn't be used by anyone else. At this point, there isn't much that can be done if the client, e.g. fails to decrypt some messages. If it fails to decrypt messages with the dehydrated device, it's unlikely that leaving the device around will fix anything in the future -- any future attempts would likely fail as well. So the best thing to do is to replace the dehydrated device with a new one anyways.

I'm not insistent on this endpoint deleting the dehydrated device, but I think that once you start using a dehydrated device, you'll want to create a new device no matter what.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more that if the device fetches the first few events but then the user closes the browser and it never gets to upload the devices, then you have effectively thrown the dried fish out the window without properly getting to use it. So in that case it should either only delete the device, when it deletes the first few messages (i.e. by the client paginating with a next token), or just wait for the user to send a new device. Since you CAN still use the same dried device from another device, I think. All of the messages will be PRE KEY messages, so you can decrypt them as long as you haven't deleted the one time keys from the pickled device. So even if a client downloads the first batch of messages and then starts with the next batch and the first batch gets deleted, a different client should still be able to pick up from there.

I agree that you want to create a new device no matter what, but that can just be done by uploading a new one instead of implicitly doing it when receiving messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the issue of a client starting to load events and then dying somehow is possible, but seems like it would be extremely rare.

I think another consideration is that a client could forget to replace the dehydrated device. If the device gets deleted automatically, then it makes it obvious that the client didn't do that.

In any event, I think it's fine to try it out with GET and without automatically deleting the device, and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also an issue if there's a connection problem during the POST request; the client presumably won't be able to re-try the request because the device will have been deleted.
We generally try to design our APIs so that they work / can be retried even if the request fails half-way through for some reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the idea is that the dehydrated device is "deleted" in the sense that no other client can claim it, and if a client queries for the dehydrated device, it won't be returned. But the events associated with it are still there and can be retrieved (until the events get deleted as described elsewhere in the MSC), so if the client re-tries the POST request, it will still get the events.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem is just that if the client fails to replace the device, then a failed login attempt will break the device dehydration until the next successfull login, since there is no way to receive messages in the meantime (while that would work fine if the device is just kept).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ran into another race condition here in production. We currently have it implemented that the PUT of a new device removes the old device. However, since uploading a new device takes several requests (claim new device, upload keys, sign it, upload encrypted device), we run into a race condition, where the user closes the browser window during one of the steps and maybe only signs back in later. That means we have an unhydrateable device and we again lose messages over the gap. Ideally there would be some way to make this atomic to prevent this race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that having this be a PUT is the wrong pattern here. If the whole idea here is for a Client to fetch to-device events and room keys then we should design the flows in a way that minimizes the risk of loss of such room keys.

As it stands the flows can't be resumed once you start fetching the to-device events, and the fetching of the to-device events will be by far the longest operation here.

What would allow perfect resumption is:

  1. The dehydrated device only gets deleted if the client requests so.
  2. The to-device events only get deleted¹ when the dehydrated device gets deleted, this is opposed to the current mechanism, where to-device events get deleted once the server sees a next_batch from a previous request.

No 2. would ensure that, even if a device that attempts a rehydration gets stopped and deleted mid-rehydration, another, new device can restart the rehydration process.

I actually ran into another race condition here in production. We currently have it implemented that the PUT of a new device removes the old device. However, since uploading a new device takes several requests (claim new device, upload keys, sign it, upload encrypted device),

Agree here as well, PUT of a new device should happen in a single request which should upload the dehydrated device, its device keys and any one-time keys. I implemented a draft version of this behavior in this patch: matrix-org/synapse@777b305

client calls `POST /dehydrated_device/{device_id}/events` with a `next_batch`
token, the server can delete any to-device messages delivered in previous
batches. It is recommended that, for the last batch of messages, the server
still send a `next_batch` token, and return an empty `events` array when called
with that token, so that it knows that the client has successfully received all
the messages.
uhoreg marked this conversation as resolved.
Show resolved Hide resolved

### Device Dehydration Format

TODO: define a format. Unlike MSC2679, we don't need to worry about the
dehydrated device being used as a normal device, so we can omit some
information. So we should be able to get by with defining a fairly simple
standard format, probably just the concatenation of the private device keys and
the private one-time keys. This will come at the expense of implementations
such as libolm needing to implement extra functions to support dehydration, but
will have the advantage that we don't need to figure out a format that will fit
into every possible implementation's idiosyncrasies. The format will be
encrypted, which leads to ...

#### Encryption key

The encryption key used for the dehydrated device will be randomly generated
and stored/shared via SSSS using the name `m.dehydrated_device`.
uhoreg marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I'm reading the iOS implementation correctly, the key is encoded with unpadded base64 (as is done with the other keys in secret storage)


## Potential issues

The same issues as in
[MSC2697](https://github.com/matrix-org/matrix-doc/pull/2697) are present for
this proposal. For completeness, they are repeated here:

### One-time key exhaustion

The dehydrated device may run out of one-time keys, since it is not backed by
an active client that can replenish them. Once a device has run out of
one-time keys, no new olm sessions can be established with it, which means that
devices that have not already shared megolm keys with the dehydrated device
will not be able to share megolm keys. This issue is not unique to dehydrated
devices; this also occurs when devices are offline for an extended period of
time.

This may be addressed by using fallback keys as described in
[MSC2732](https://github.com/matrix-org/matrix-doc/pull/2732).
uhoreg marked this conversation as resolved.
Show resolved Hide resolved

To reduce the chances of one-time key exhaustion, if the user has an active
client, it can periodically replace the dehydrated device with a new dehydrated
device with new one-time keys. If a client does this, then it runs the risk of
losing any megolm keys that were sent to the dehydrated device, but the client
would likely have received those megolm keys itself.
Comment on lines +336 to +340
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing this [replacing the dehydrated device periodically] or not?

It seems like both have serious downsides. If we do replace it, we have a very racy operation that is certain to cause UTDs in practice. If we don't replace it, then we'll end up with no remaining OTKs at all, and an incredibly long list of to-device messages all of which have to be downloaded and decrypted by any new clients.


Alternatively, the client could perform a `/sync` for the dehydrated device,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sill works with v2? can we still sync on the dehydrated device?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't sync as a different device in this proposal. You can fetch the events for that device, but this proposal implicitly deletes the device in that case, which means you can't keep the device alive after that. So imo your only option is to replace it (which is somewhat easy to do, but you might need to authenticate the new signature upload/device?).

dehydrate the olm sessions, and upload new one-time keys. By doing this
instead of overwriting the dehydrated device, the device can receive megolm
keys from more devices. However, this would require additional server-side
changes above what this proposal provides, so this approach is not possible for
the moment.

### Accumulated to-device messages

If a dehydrated device is not rehydrated for a long time, then it may
accumulate many to-device messages from other clients sending it megolm
sessions. This may result in a slower initial sync when the device eventually
does get rehydrated, due to the number of messages that it will retrieve.
Again, this can be addressed by periodically replacing the dehydrated device,
or by performing a `/sync` for the dehydrated device and updating it.

## Alternatives

As mentioned above,
[MSC2697](https://github.com/matrix-org/matrix-doc/pull/2697) tries to solve
the same problem in a similar manner, but has several disadvantages that are
fixed in this proposal.

Rather than keep the name "dehydrated device", we could change the name to
something like "shrivelled sessions", so that the full expansion of this MSC
title would be "Shrivelled Sessions with Secure Secret Storage and Sharing", or
SSSSSS. However, despite the alliterative property, the term "shrivelled
sessions" is less pleasant, and "dehydrated device" is already commonly used to
refer to this feature.

The alternatives discussed in MSC2697 are also alternatives here.


## Security considerations

The security consideration in MSC2697 also applies to this proposal: If the
dehydrated device is encrypted using a weak password or key, an attacker could
access it and read the user's encrypted messages.

## Unstable prefix

While this MSC is in development, the `/dehydrated_device` endpoints will be
reached at `/unstable/org.matrix.msc3814.v1/dehydrated_device`, and the
`/dehydrated_device/{device_id}/events` endpoint will be reached at
`/unstable/org.matrix.msc3814.v1/dehydrated_device/{device_id}/events`. The
dehydration algorithm `m.dehydration.v1.olm` will be called
`org.matrix.msc3814.v1.olm`. The SSSS name for the dehydration key will be
`org.matrix.msc3814` instead of `m.dehydrated_device`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client implementation: https://gitlab.com/famedly/company/frontend/famedlysdk/-/merge_requests/1111

Server implementation: matrix-org/synapse#13581

Both not merged yet and notably missing is the dehydrated device format.

## Dependencies

None