memory-store: release locks earlier to avoid deadlocks #3668

kegsay · 2024-07-09T08:48:00Z

I've been debugging a cause of flakey complement-crypto tests for about a month now. I was pretty convinced it was deadlocking somewhere in the memory store save_changes code. With additional logging, it's now clear that the there is an ABBA style deadlock when save_changes is called at the same time as get_state_events.

I've also adjusted code for get_user_ids as it has a very similar pattern and also acquires locks in reverse order to save_changes, so is potentially vulnerable to this.

I've been debugging a cause of flakey complement-crypto tests for about a month now. I was pretty convinced it was deadlocking somewhere in the memory store `save_changes` code. With additional logging, it's now clear that the there is an ABBA style deadlock when `save_changes` is called at the same time as `get_state_events`. I've also adjusted code for `get_user_ids` as it has a very similar pattern and also acquires locks in reverse order to `save_changes`, so is potentially vulnerable to this.

poljar

A regression test would be nice, but I guess we can cheat a bit and use complement for this.

Previously we would only use persistence when the test strictly needed it e.g to test what happens between restarts. It's preferable to always run with persistence because: - it more accurately models real Element X clients e.g performance characteristics, runtime code. - any bugs in the DB layer are then also bugs in the client. Using the memory store has proven [error prone](matrix-org/matrix-rust-sdk#3668) and fixing these bugs don't improve the stability of EX at all. `ClientCreationOpts` will still have the `Persistence` flag though, as it remains useful to know when a test _needs_ persistence vs when it does not. In the future, we may clean up locally stored files during test runtime, and this flag would allow us to know whether it is safe to delete the files or not.

codecov · 2024-07-09T09:02:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.28%. Comparing base (02dddb4) to head (115bf6c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3668   +/-   ##
=======================================
  Coverage   84.27%   84.28%           
=======================================
  Files         259      259           
  Lines       26662    26662           
=======================================
+ Hits        22469    22471    +2     
+ Misses       4193     4191    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kegsay requested a review from a team as a code owner July 9, 2024 08:48

kegsay requested review from poljar and removed request for a team July 9, 2024 08:48

poljar approved these changes Jul 9, 2024

View reviewed changes

kegsay mentioned this pull request Jul 9, 2024

rust: always use persistence in clients matrix-org/complement-crypto#115

Merged

poljar enabled auto-merge (rebase) July 9, 2024 09:01

poljar merged commit e5f9294 into main Jul 9, 2024
40 checks passed

poljar deleted the kegan/memory-store-deadlock branch July 9, 2024 09:02

This was referenced Jul 9, 2024

rust: MemoryStore wedging? matrix-org/complement-crypto#77

Closed

WIP: debug logging for https://github.com/matrix-org/complement-crypto/issues/110 #3666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory-store: release locks earlier to avoid deadlocks #3668

memory-store: release locks earlier to avoid deadlocks #3668

kegsay commented Jul 9, 2024

poljar left a comment

codecov bot commented Jul 9, 2024

memory-store: release locks earlier to avoid deadlocks #3668

memory-store: release locks earlier to avoid deadlocks #3668

Conversation

kegsay commented Jul 9, 2024

poljar left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 9, 2024

Codecov Report