Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

kegsay · 2024-02-08T13:51:27Z

I've been looking into bug reports on Element X, and this appears on iOS: (snipped a lot of data)

App:
2024-02-06T07:23:56.293963Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.296398Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.298524Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 
...
2024-02-06T07:23:56.300586Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:23:56.301713Z DEBUG matrix_sdk_crypto::machine: Received a decrypted to-device event | crates/matrix-sdk-crypto/src/machine.rs:975 | 
2024-02-06T07:23:56.302390Z INFO matrix_sdk_crypto::machine: Received a new megolm room key | crates/matrix-sdk-crypto/src/machine.rs:755 | 

NSE:
2024-02-06T07:46:24.132801Z DEBUG matrix_sdk_crypto::machine: Received a to-device event | crates/matrix-sdk-crypto/src/machine.rs:1117 | 
2024-02-06T07:46:24.133377Z DEBUG matrix_sdk_crypto::olm::account: Creating a new Olm session from a pre-key message | crates/matrix-sdk-crypto/src/olm/account.rs:953 | 
2024-02-06T07:46:24.133655Z WARN matrix_sdk_crypto::olm::account: Failed to create a new Olm session from a pre-key message: InboundCreation(MissingOneTimeKey("curve25519:3ZuWFzabqiB1bYH1iE/dys5sCxSh8/8KKgo1iNZxsE8")) | crates/matrix-sdk-crypto/src/olm/account.rs:1220 | 
2024-02-06T07:46:24.135368Z INFO matrix_sdk_crypto::session_manager::sessions: Marking session to be unwedged sender_key="curve25519:KWr4ACcEDQ/+VVP7/gHSNGDv1hu9NA4GZZqVckzAgEg" | crates/matrix-sdk-crypto/src/session_manager/sessions.rs:116 |

The important thing here is that the session was merrily decrypting to-device events fine, until the NSE process took over. At that point, it failed to find the session, and then failed to decrypt the event (due to the sender chain index being >0 I guess) which then marked the Olm session as wedged.

I believe the problem lies in decrypt_olm_message because the NSE process fails to find the session in existing_sessions, implying a cache miss when it should have hit.

The text was updated successfully, but these errors were encountered:

bnjbvr · 2024-02-08T17:12:24Z

Thanks for opening an issue. Indeed it looks like the session cache in the NSE isn't being reloaded, for some reason, and addressing #2624 might help with this.

bnjbvr · 2024-02-08T17:45:58Z

@kegsay do you see this line in the Rust logs, out of curiosity?

error when extending lock lease:

kegsay · 2024-02-09T12:51:42Z

No I do not see this log line. I did trace the "crypto store mismatch" stuff to see if it was ever wrong, and based on my reading of it, it was always very sane and reasonable.

richvdh · 2024-02-22T12:04:12Z

Presumably this only affects Element X iOS, because that is the only platform that uses two processes?

kegsay · 2024-02-22T15:43:27Z

Yes. Web with multiple tabs would also be similarly affected.

kegsay · 2024-02-29T15:34:51Z

I don't understand why we can't just disable the in-memory cache here. This would mean we exclusively do database hits which would work (sqlite supports multiprocess just fine: https://www.sqlite.org/faq.html#q5 ) If there are performance concerns, can we just profile it and see if that's indeed the case?

bnjbvr · 2024-02-29T15:43:03Z

For what it's worth, it's not only about having atomic transactions and enabling the WAL in sqlite. A few reasons off the top of my head:

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.
even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.
"just disabling the memory cache" doesn't really make sense: it's not a single cache, it's a set of fields in structs scattered across the code base. There was a project to try to reunify those into an actual single cache structure, but this project has been considered low priority by Element's management and thus dropped.
The in-memory caches have been added exactly because of performance reasons, so we'd revert back to a state where we have performance battles we've already fought and won. We might need more benchmarks to prove if a cache is actually required for performance or not.

kegsay · 2024-02-29T16:14:40Z

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

richvdh · 2024-03-01T14:39:35Z

we need to have some support for doing concurrent writes in indexeddb, and indexeddb doesn't have an equivalent to WAL, as far as I know.

Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough?

As Kegan has said, indexeddb is reasonably resilient to multiple "processes" attempting to write to it at once, although its transaction semantics are annoying and interact poorly with asynchronous code.

However, I don't think we should concern ourselves too much with indexeddb here; multiprocess behaviour on web is not currently an issue due to element-hq/element-web#25157, and right now I think it's more likely that we'll solve multi-tab operation by having a single worker process which is shared between tabs at the JS level than by trying to make concurrent matrix-sdk-crypto processes truly robust. (Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

In other words: if indexeddb support were the only thing worrying us here, I wouldn't count that as a good argument.

bnjbvr · 2024-03-01T14:43:17Z

(Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.)

iOS Share Extension enters the room

(This is another process that will be used to send encrypted messages, for what it's worth.)

richvdh · 2024-03-01T14:44:42Z

ugh.

ara4n · 2024-03-02T15:20:32Z

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

ara4n · 2024-03-03T12:50:55Z

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

kegsay · 2024-03-04T11:12:53Z

NSE can send messages in EI today, though accidentally, which is what element-hq/element-ios#7751 is talking about - specifically key sharing requests.

BillCarsonFr · 2024-03-04T12:53:46Z

Context Links:
IOS/EIR: share and quick reply are disabled element-hq/element-ios#7618
IOS/EIR: Background machine matrix-org/matrix-ios-sdk#1704

richvdh · 2024-03-04T12:55:01Z

also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc.

For the record, I believe this is currently incorrect: the share extension and quick reply are currently disabled. Though it doesn't entirely change the point here: we need a "proper" fix to that eventually.

(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem)

hrmrmrmrm you may be right. Web locks make implementing the concurrency in the Rust less horrifying than it used to be.

That said: I don't think we've really figured out the semantics for how different processes can call /sync (and then OlmMachine.receive_sync_changes) and OlmMachine.outgoing_requests without fighting with each other. For example: a particular failure mode might be for both tabs to upload the same batch of OTKs: if one of the keys was claimed in the meantime, we could end up with UISIs. (Though, maybe we have the same problem with single tabs, since we do retries on /keys/upload requests.) We solve this in iOS by having the extensions not generate OTKs, which doesn't really work in web because all processes are equal; if you decide to elect a "master" then you're basically back to web workers.

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

BillCarsonFr · 2024-03-04T13:10:20Z

Related NSE problem, causing re-use of megolm index (detected as replay attack) element-hq/element-ios#7499

kegsay · 2024-03-05T15:50:09Z

Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need.

I agree, we may need to be more creative to support multi-tab web (e.g Web Locks) but I don't think this should be a blocker for progress on this issue.

even if we had purely atomic transactions, now we'd need to have wide application-level transactions with rollback. For instance, you receive an encrypted message; decrypting it will move the associated ratchet forward in the Account, which you'll want to read from the DB and store back into the DB after it's moved forward. So you're running an UPDATE Account(...) IF { state is what you expected before, aka it hasn't changed in the database }. If this transaction fails (because the state isn't what you expected, aka another process touched it), you now need to roll back to the previous Account state (and all associated state), and replay all the actions that lead up to the sqlite update.

I think @bnjbvr is saying much the same thing here, that if we want true multi-process then we need additional locks in place.

richvdh · 2024-03-08T18:20:12Z

It seems like there are still a lot of unknowns here. For example, as well as the case Kegan mentioned in the OP, we've also seen cases where the NSE process appears to reload the OlmMachine from database, and still has a decryption failure. One theory is that outstanding tasks involving the OlmMachine may keep it alive after it's been destroyed, causing the "we have two OlmMachines" again, although exactly how that leads to the observed failure mode remains unclear.

Another theory is that maybe the current cross-process lock is not sound. We're not seeing much evidence of that in the logs though.

richvdh · 2024-04-09T15:24:33Z

#3313 is likely a big cause of this.

kegsay · 2024-04-18T07:38:25Z

@richvdh @andybalaam and myself have identified the root cause on this.

We re-use the SqliteCryptoStore when we regenetate_olm but the store contains a cache called session_cache which never gets updated/changed. I'm not sure how this has ever worked correctly. It seems in theory that the wedging will reliably happen if the following are true:

the nse process has started and loaded sessions
the app process does something to change the session (e.g sending a message after receiving one)
the nse process gets a 2nd message

This matches our minimal working example which does Bob receive -> Alice send -> Bob receive (this wedges the session, this assumes for each message there are new olm messages aka a rotation period of 1 msg). If we remove the cache the test passes.

We hypothesise it may also be the root cause of element-hq/element-ios#7751 and element-hq/element-ios#7480

…achine This fixes #3110

richvdh · 2024-04-18T09:27:58Z

@richvdh @andybalaam and myself have identified the root cause on this.

Correction: we have identified one root cause of this, which appears likely to cause a lot of the observed cases. If #3313 is real, it is likely another cause of these symptoms.

Adds a test for #3110 that fails before the fix and passes afterwards.

richvdh · 2024-05-13T20:52:30Z

I believe the fix for this is finally released today, in Element X iOS 1.6.7.

kegsay added bug Something isn't working encryption labels Feb 8, 2024

richvdh mentioned this issue Feb 22, 2024

Unable To Decrypt meta issue element-hq/element-meta#245

Open

64 tasks

richvdh mentioned this issue Feb 27, 2024

Olm sessions getting wedged element-hq/element-meta#2310

Open

poljar mentioned this issue Feb 28, 2024

Decryption failure caused one-time-key to be discarded element-hq/element-ios#7480

Closed

richvdh mentioned this issue Mar 8, 2024

My messages are received as UTDs element-hq/element-web#26973

Closed

richvdh mentioned this issue Apr 11, 2024

Ensure that existing megolm sessions are re-shared after a wedged olm session element-hq/element-meta#2389

Closed

kegsay added a commit that referenced this issue Apr 18, 2024

bugfix: ensure the SessionStore is cleared when regenerating the OlmM…

09955cf

…achine This fixes #3110

kegsay mentioned this issue Apr 18, 2024

bugfix: ensure the SessionStore is cleared when regenerating the OlmMachine #3338

Merged

andybalaam closed this as completed in #3338 Apr 19, 2024

gnunicorn mentioned this issue Apr 22, 2024

decryption error acterglobal/a3#1608

Closed

kegsay mentioned this issue Apr 25, 2024

Epic: One-Time Keys can desync between client and server causing UTDs element-hq/element-meta#2406

Open

richvdh mentioned this issue Apr 26, 2024

Failed to decrypt Olm message with InvalidMAC(MacError) element-hq/element-ios#7479

Closed

kegsay mentioned this issue Apr 29, 2024

UTDs: maubot/mautrix bridges fail to encrypt for EX sessions element-hq/element-meta#2387

Open

andybalaam added a commit that referenced this issue May 10, 2024

crypt: Test for the NSE race bug #3110

af35ac0

Adds a test for #3110 that fails before the fix and passes afterwards.

andybalaam mentioned this issue May 10, 2024

crypto: Add a standalone integration test for the NSE race #3343

Merged

andybalaam added a commit that referenced this issue May 10, 2024

crypt: Test for the NSE race bug #3110

a78d788

Adds a test for #3110 that fails before the fix and passes afterwards.

andybalaam added a commit that referenced this issue May 13, 2024

crypto: Test for the NSE race bug #3110

6c6a33c

Adds a test for #3110 that fails before the fix and passes afterwards.

andybalaam added a commit that referenced this issue May 13, 2024

crypto: Test for the NSE race bug #3110

06825c6

Adds a test for #3110 that fails before the fix and passes afterwards.

richvdh mentioned this issue May 14, 2024

Olm sessions broken by use of outdated cache element-hq/element-ios#7790

Closed

BillCarsonFr mentioned this issue May 24, 2024

Invalidate crypto store cache when entering foreground matrix-org/matrix-ios-sdk#1854

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

kegsay commented Feb 8, 2024

bnjbvr commented Feb 8, 2024

bnjbvr commented Feb 8, 2024

kegsay commented Feb 9, 2024

richvdh commented Feb 22, 2024

kegsay commented Feb 22, 2024

kegsay commented Feb 29, 2024

bnjbvr commented Feb 29, 2024

kegsay commented Feb 29, 2024

richvdh commented Mar 1, 2024

bnjbvr commented Mar 1, 2024

richvdh commented Mar 1, 2024

ara4n commented Mar 2, 2024

ara4n commented Mar 3, 2024

kegsay commented Mar 4, 2024

BillCarsonFr commented Mar 4, 2024

richvdh commented Mar 4, 2024

BillCarsonFr commented Mar 4, 2024

kegsay commented Mar 5, 2024

richvdh commented Mar 8, 2024

richvdh commented Apr 9, 2024

kegsay commented Apr 18, 2024 •

edited

Loading

richvdh commented Apr 18, 2024 •

edited

Loading

richvdh commented May 13, 2024

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110

Comments

kegsay commented Feb 8, 2024

bnjbvr commented Feb 8, 2024

bnjbvr commented Feb 8, 2024

kegsay commented Feb 9, 2024

richvdh commented Feb 22, 2024

kegsay commented Feb 22, 2024

kegsay commented Feb 29, 2024

bnjbvr commented Feb 29, 2024

kegsay commented Feb 29, 2024

richvdh commented Mar 1, 2024

bnjbvr commented Mar 1, 2024

richvdh commented Mar 1, 2024

ara4n commented Mar 2, 2024

ara4n commented Mar 3, 2024

kegsay commented Mar 4, 2024

BillCarsonFr commented Mar 4, 2024

richvdh commented Mar 4, 2024

BillCarsonFr commented Mar 4, 2024

kegsay commented Mar 5, 2024

richvdh commented Mar 8, 2024

richvdh commented Apr 9, 2024

kegsay commented Apr 18, 2024 • edited Loading

richvdh commented Apr 18, 2024 • edited Loading

richvdh commented May 13, 2024

kegsay commented Apr 18, 2024 •

edited

Loading

richvdh commented Apr 18, 2024 •

edited

Loading