Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Could not serialize access due to concurrent DELETE from presence_stream #15827

Closed
realtyem opened this issue Jun 25, 2023 · 0 comments · Fixed by #15826
Closed

Could not serialize access due to concurrent DELETE from presence_stream #15827

realtyem opened this issue Jun 25, 2023 · 0 comments · Fixed by #15826
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Presence O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@realtyem
Copy link
Contributor

2023-06-24 20:17:27.496 CDT [29164] ERROR:  could not serialize access due to concurrent delete
2023-06-24 20:17:27.496 CDT [29164] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31609735 AND user_id = ANY(ARRAY[<a list of 34 different users>])
2023-06-25 00:19:10.120 CDT [29528] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:19:10.120 CDT [29528] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624045 AND user_id = ANY(ARRAY['@user:someserver'])
2023-06-25 00:21:01.669 CDT [29511] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.669 CDT [29511] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624211 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:01.669 CDT [29526] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.669 CDT [29526] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624212 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:01.679 CDT [29526] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:01.679 CDT [29526] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624212 AND user_id = ANY(ARRAY['@otheruser:someotherserver'])
2023-06-25 00:21:02.085 CDT [29527] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.085 CDT [29527] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624214 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.117 CDT [29498] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.117 CDT [29498] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624217 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.117 CDT [29524] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.117 CDT [29524] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624218 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])
2023-06-25 00:21:02.128 CDT [29524] ERROR:  could not serialize access due to concurrent delete
2023-06-25 00:21:02.128 CDT [29524] STATEMENT:  DELETE FROM presence_stream WHERE stream_id < 31624218 AND user_id = ANY(ARRAY['@thirduser:yetanotherserver'])

So this is annoying. And while it doesn't always occur so regularly, it does seem to come in batches. I took a little while to investigate this and found out:

  1. The function this occurs in, PresenceStore.update_presence(), takes between 35-47ms to complete when this happens, but only 5-25ms when it doesn't(on my two user workerized-homeserver). This appears to be due to the rollback of the transaction before retrying.(These numbers should not be considered standardized, as variance across several factors can influence them. e.g. Monolith Postgres would sometimes take 11-95ms without any contention. The point is that a rollback transaction can almost double the time it takes to complete it, or more if it has to try multiple times.)
  2. Inside the presence_handler, there are 3 places this PresenceStore function is called:
    1. _on_shutdown() - Fairly self-evident what this does.
    2. _persist_unpersisted_changes() - used like a drag net to make sure anything that still needs to be written to the database gets done, as otherwise it will stack up and affect shutdown times. Run as a looping call every 60 seconds.
    3. handle_presence_timeouts(which is actually _handle_timeouts()) -> _update_states() -> _persist_and_notify() - which is the main point of entry for changes in presence states, and is responsible for sending changes over federation and persisting those changes. Run as a looping call every 5 seconds.

The main reason this seems to occur is that _persist_unpersisted_changes() and _persist_and_notify() sometimes run at the same time, due to overlapping timeouts. The process of persisting presence changes includes running a DELETE before the INSERT, in order to clean up previous Stream ID's. When _persist_unpersisted_changes() runs at the same time as _persist_and_notify() it has a lower Stream ID which seems to be confusing the database/transaction. The solution seems to be deal with this in a way that allows the transaction to serialize.

@MadLittleMods MadLittleMods added A-Presence S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Presence O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants