Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure device list updates are robust to race conditions and network failures #432

Merged
merged 3 commits into from
May 10, 2024

Conversation

kegsay
Copy link
Member

@kegsay kegsay commented May 9, 2024

Fixes #430

This does two things:

  • Fix a hypothetical race condition which could cause a lost update at insert time (due to lack of .. FOR UPDATE)
  • Fix an issue which will cause lost device list updates if the response containing the update is A) lost and B) the client starts a new connection for whatever reason e.g timed out, full buffer, server restart.

This adds a regression test and refactors the unit tests to be testing the correct semantics.

History on this is:

  • the original impl worked but was confusing
  • dmr refactored it in 740c14c but did introduce a bug. The result previously had the New items put into Sent (the original intent) but now we were copying result to writeBack and then swapping_ New to Sent. As the caller only looks in Sent, it won't see any device list updates. This would fix itself on the next request because we then return Sent confusingly. However, the semantics are like this to guard against network failures, and since this refactor specific network failures would drop the update.
  • tests got written using the buggy code as a guide 1fe920d

I found this out when trying to fix this test in complement crypto, which is sensitive to how quickly the device list update will be delivered to the EX client. Hence me finding the bug and fixing it.

This is effectively a UTD cause for the purposes of element-hq/element-meta#245 because it meant EX clients may miss a device list update, causing EX clients to fail to encrypt for newly logged in devices (just like that failing complement crypto test).

kegsay added 2 commits May 8, 2024 14:26
As we are selecting... for updates. Without this, we can drop
updates to the floor incorrectly.

See #430
Otherwise we can cause lost device list updates if the client resets
the connection (as the device list update won't be in any in-memory cache).
@kegsay kegsay added the bug Something isn't working label May 9, 2024
They were A) confusing and B) testing the wrong thing as they got refactored
after the bad refactor of the source code..
@kegsay kegsay requested a review from devonh May 9, 2024 14:33
Copy link

@devonh devonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything checks out to me. The new test logic is clear and easy to follow.

@kegsay kegsay merged commit 0d22cf1 into main May 10, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device lists don't get resent on rapid changes
2 participants