Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

Closed
kegsay opened this issue Feb 8, 2024 · 23 comments · Fixed by #3338
Closed

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

kegsay opened this issue Feb 8, 2024 · 23 comments · Fixed by #3338
Labels
bug Something isn't working encryption

Comments

@kegsay
Copy link
Member

kegsay commented Feb 8, 2024

I've been looking into bug reports on Element X, and this appears on iOS: (snipped a lot of data)

App:
2024-02-06T07:23:56.293963Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.296398Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.298524Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 
...
2024-02-06T07:23:56.300586Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.301713Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.302390Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 

NSE:
2024-02-06T07:46:24.132801Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:46:24.133377Z DEBUG matrix_sdk_crypto::olm::account: Creating a new Olm session from a pre-key message | crates/matrix-sdk-crypto/src/olm/account.rs:953 | 
2024-02-06T07:46:24.133655Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:3ZuWFzabqiB1bYH1iE/dys5sCxSh8/8KKgo1iNZxsE8")) | crates/matrix-sdk-crypto/src/olm/account.rs:1220 | 
2024-02-06T07:46:24.135368Z INFO matrix_sdk_crypto::session_manager::sessions: Marking session to be unwedged sender_key="curve25519:KWr4ACcEDQ/+VVP7/gHSNGDv1hu9NA4GZZqVckzAgEg" | crates/matrix-sdk-crypto/src/session_manager/sessions.rs:116 | 

The important thing here is that the session was merrily decrypting to-device events fine, until the NSE process took over. At that point, it failed to find the session, and then failed to decrypt the event (due to the sender chain index being >0 I guess) which then marked the Olm session as wedged.

I believe the problem lies in decrypt_olm_message because the NSE process fails to find the session in existing_sessions, implying a cache miss when it should have hit.

@kegsay kegsay added bug Something isn't working encryption labels Feb 8, 2024
@bnjbvr
Copy link
Member

bnjbvr commented Feb 8, 2024

Thanks for opening an issue. Indeed it looks like the session cache in the NSE isn't being reloaded, for some reason, and addressing #2624 might help with this.

@bnjbvr
Copy link
Member

bnjbvr commented Feb 8, 2024

@kegsay do you see this line in the Rust logs, out of curiosity?

error when extending lock lease:

@kegsay
Copy link
Member Author

kegsay commented Feb 9, 2024

No I do not see this log line. I did trace the "crypto store mismatch" stuff to see if it was ever wrong, and based on my reading of it, it was always very sane and reasonable.

@richvdh
Copy link
Member

richvdh commented Feb 22, 2024

Presumably this only affects Element X iOS, because that is the only platform that uses two processes?

@kegsay
Copy link
Member Author

kegsay commented Feb 22, 2024

Yes. Web with multiple tabs would also be similarly affected.

@kegsay
Copy link
Member Author

kegsay commented Feb 29, 2024

I don't understand why we can't just disable the in-memory cache here. This would mean we exclusively do database hits which would work (sqlite supports multiprocess just fine: https://www.sqlite.org/faq.html#q5 ) If there are performance concerns, can we just profile it and see if that's indeed the case?

@bnjbvr
Copy link
Member

bnjbvr commented Feb 29, 2024

For what it's worth, it's not only about having atomic transactions and enabling the WAL in sqlite. A few reasons off the top of my head:

  1. we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.
  2. even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.
  3. "just disabling the memory cache" doesn't really make sense: it's not a single cache, it's a set of fields in structs scattered across the code base. There was a project to try to reunify those into an actual single cache structure, but this project has been considered low priority by Element's management and thus dropped.
  4. The in-memory caches have been added exactly because of performance reasons, so we'd revert back to a state where we have performance battles we've already fought and won. We might need more benchmarks to prove if a cache is actually required for performance or not.

@kegsay
Copy link
Member Author

kegsay commented Feb 29, 2024

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

@richvdh
Copy link
Member

richvdh commented Mar 1, 2024

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

As Kegan has said, indexeddb is reasonably resilient to multiple "processes" attempting to write to it at once, although its transaction semantics are annoying and interact poorly with asynchronous code.

However, I don't think we should concern ourselves too much with indexeddb here; multiprocess behaviour on web is not currently an issue due to element-hq/element-web#25157, and right now I think it's more likely that we'll solve multi-tab operation by having a single worker process which is shared between tabs at the JS level than by trying to make concurrent matrix-sdk-crypto processes truly robust. (Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

In other words: if indexeddb support were the only thing worrying us here, I wouldn't count that as a good argument.

@bnjbvr
Copy link
Member

bnjbvr commented Mar 1, 2024

(Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

iOS Share Extension enters the room

(This is another process that will be used to send encrypted messages, for what it's worth.)

@richvdh
Copy link
Member

richvdh commented Mar 1, 2024

ugh.

@ara4n
Copy link
Member

ara4n commented Mar 2, 2024

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

@ara4n
Copy link
Member

ara4n commented Mar 3, 2024

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

@kegsay
Copy link
Member Author

kegsay commented Mar 4, 2024

NSE can send messages in EI today, though accidentally, which is what element-hq/element-ios#7751 is talking about - specifically key sharing requests.

@BillCarsonFr
Copy link
Member

Context Links:
IOS/EIR: share and quick reply are disabled element-hq/element-ios#7618
IOS/EIR: Background machine matrix-org/matrix-ios-sdk#1704

@richvdh
Copy link
Member

richvdh commented Mar 4, 2024

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

For the record, I believe this is currently incorrect: the share extension and quick reply are currently disabled. Though it doesn't entirely change the point here: we need a "proper" fix to that eventually.

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

hrmrmrmrm you may be right. Web locks make implementing the concurrency in the Rust less horrifying than it used to be.

That said: I don't think we've really figured out the semantics for how different processes can call /sync (and then OlmMachine.receive_sync_changes) and OlmMachine.outgoing_requests without fighting with each other. For example: a particular failure mode might be for both tabs to upload the same batch of OTKs: if one of the keys was claimed in the meantime, we could end up with UISIs. (Though, maybe we have the same problem with single tabs, since we do retries on /keys/upload requests.) We solve this in iOS by having the extensions not generate OTKs, which doesn't really work in web because all processes are equal; if you decide to elect a "master" then you're basically back to web workers.

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

@BillCarsonFr
Copy link
Member

Related NSE problem, causing re-use of megolm index (detected as replay attack) element-hq/element-ios#7499

@kegsay
Copy link
Member Author

kegsay commented Mar 5, 2024

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

I agree, we may need to be more creative to support multi-tab web (e.g Web Locks) but I don't think this should be a blocker for progress on this issue.

even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.

I think @bnjbvr is saying much the same thing here, that if we want true multi-process then we need additional locks in place.

@richvdh
Copy link
Member

richvdh commented Mar 8, 2024

It seems like there are still a lot of unknowns here. For example, as well as the case Kegan mentioned in the OP, we've also seen cases where the NSE process appears to reload the OlmMachine from database, and still has a decryption failure. One theory is that outstanding tasks involving the OlmMachine may keep it alive after it's been destroyed, causing the "we have two OlmMachines" again, although exactly how that leads to the observed failure mode remains unclear.

Another theory is that maybe the current cross-process lock is not sound. We're not seeing much evidence of that in the logs though.

@richvdh
Copy link
Member

richvdh commented Apr 9, 2024

#3313 is likely a big cause of this.

@kegsay
Copy link
Member Author

kegsay commented Apr 18, 2024

@richvdh @andybalaam and myself have identified the root cause on this.

We re-use the SqliteCryptoStore when we regenetate_olm but the store contains a cache called session_cache which never gets updated/changed. I'm not sure how this has ever worked correctly. It seems in theory that the wedging will reliably happen if the following are true:

  • the nse process has started and loaded sessions
  • the app process does something to change the session (e.g sending a message after receiving one)
  • the nse process gets a 2nd message

This matches our minimal working example which does Bob receive -> Alice send -> Bob receive (this wedges the session, this assumes for each message there are new olm messages aka a rotation period of 1 msg). If we remove the cache the test passes.

We hypothesise it may also be the root cause of element-hq/element-ios#7751 and element-hq/element-ios#7480

@richvdh
Copy link
Member

richvdh commented Apr 18, 2024

@richvdh @andybalaam and myself have identified the root cause on this.

Correction: we have identified one root cause of this, which appears likely to cause a lot of the observed cases. If #3313 is real, it is likely another cause of these symptoms.

andybalaam added a commit that referenced this issue May 10, 2024
Adds a test for #3110
that fails before the fix and passes afterwards.
andybalaam added a commit that referenced this issue May 10, 2024
Adds a test for #3110
that fails before the fix and passes afterwards.
andybalaam added a commit that referenced this issue May 13, 2024
Adds a test for #3110
that fails before the fix and passes afterwards.
andybalaam added a commit that referenced this issue May 13, 2024
Adds a test for #3110
that fails before the fix and passes afterwards.
@richvdh
Copy link
Member

richvdh commented May 13, 2024

I believe the fix for this is finally released today, in Element X iOS 1.6.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working encryption
Projects
None yet
5 participants