Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory-store: release locks earlier to avoid deadlocks #3668

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

kegsay
Copy link
Member

@kegsay kegsay commented Jul 9, 2024

I've been debugging a cause of flakey complement-crypto tests for about a month now. I was pretty convinced it was deadlocking somewhere in the memory store save_changes code. With additional logging, it's now clear that the there is an ABBA style deadlock when save_changes is called at the same time as get_state_events.

I've also adjusted code for get_user_ids as it has a very similar pattern and also acquires locks in reverse order to save_changes, so is potentially vulnerable to this.

I've been debugging a cause of flakey complement-crypto tests for
about a month now. I was pretty convinced it was deadlocking
somewhere in the memory store `save_changes` code. With additional
logging, it's now clear that the there is an ABBA style deadlock
when `save_changes` is called at the same time as `get_state_events`.

I've also adjusted code for `get_user_ids` as it has a very similar
pattern and also acquires locks in reverse order to `save_changes`,
so is potentially vulnerable to this.
@kegsay kegsay requested a review from a team as a code owner July 9, 2024 08:48
@kegsay kegsay requested review from poljar and removed request for a team July 9, 2024 08:48
Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regression test would be nice, but I guess we can cheat a bit and use complement for this.

kegsay added a commit to matrix-org/complement-crypto that referenced this pull request Jul 9, 2024
Previously we would only use persistence when the test strictly needed
it e.g to test what happens between restarts. It's preferable to always
run with persistence because:
 - it more accurately models real Element X clients e.g performance
   characteristics, runtime code.
 - any bugs in the DB layer are then also bugs in the client. Using
   the memory store has proven [error prone](matrix-org/matrix-rust-sdk#3668)
   and fixing these bugs don't improve the stability of EX at all.

`ClientCreationOpts` will still have the `Persistence` flag though,
as it remains useful to know when a test _needs_ persistence vs when
it does not. In the future, we may clean up locally stored files
during test runtime, and this flag would allow us to know whether it
is safe to delete the files or not.
@poljar poljar enabled auto-merge (rebase) July 9, 2024 09:01
@poljar poljar merged commit e5f9294 into main Jul 9, 2024
40 checks passed
Copy link

codecov bot commented Jul 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.28%. Comparing base (02dddb4) to head (115bf6c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3668   +/-   ##
=======================================
  Coverage   84.27%   84.28%           
=======================================
  Files         259      259           
  Lines       26662    26662           
=======================================
+ Hits        22469    22471    +2     
+ Misses       4193     4191    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants