Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Long running schema deltas are killed by systemd and lead to bootlooping #14401

Open
squahtx opened this issue Nov 9, 2022 · 13 comments
Open
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@squahtx
Copy link
Contributor

squahtx commented Nov 9, 2022

When a schema delta takes too long, Synapse will get restarted by systemd, possibly repeatedly.

~~When using postgres, schema deltas are transactional. Synapse will run the schema delta and time out again. ~~
When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail. Tracked in #14909.

This manifested as:

Things we may consider doing:

@squahtx squahtx added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely labels Nov 9, 2022
@DMRobertson
Copy link
Contributor

Is it possible to make the sqlite schema deltas transactional too? (Were they not before---did I break something in #13873?)

@squahtx
Copy link
Contributor Author

squahtx commented Nov 9, 2022

I forgot to mention, that understanding was based on the docstring for executescript(). I don't think #13873 broke it.

@staticmethod
def executescript(cursor: sqlite3.Cursor, script: str) -> None:
"""Execute a chunk of SQL containing multiple semicolon-delimited statements.
Python's built-in SQLite driver does not allow you to do this with DBAPI2's
`execute`:
> execute() will only execute a single SQL statement. If you try to execute more
> than one statement with it, it will raise a Warning. Use executescript() if
> you want to execute multiple SQL statements with one call.
Though the docs for `executescript` warn:
> If there is a pending transaction, an implicit COMMIT statement is executed
> first. No other implicit transaction control is performed; any transaction
> control must be added to sql_script.
"""
cursor.executescript(script)

@squahtx
Copy link
Contributor Author

squahtx commented Nov 9, 2022

Is it possible to make the sqlite schema deltas transactional too?

We could try injecting BEGIN and COMMIT statements into the script and see what happens.
According to https://stackoverflow.com/questions/4692690/is-it-possible-to-roll-back-create-table-and-alter-table-statements-in-major-sql, sqlite does support transactional DDL. It's only MySQL that doesn't.

@rltas
Copy link

rltas commented Nov 24, 2022

Probably a duplicate of #14100, also see @richvdh's comment in #13193, which I kind of agree with:

if you've got a big enough sqlite database to notice the delay, you're doing it wrong.

Use postgres or run update_synapse_database. Nothing wrong with preventing corruption in case this problem occurs, but I wouldn't try to support even larger sqlite instances through arbitrary timeouts (until someone hits even that limit).

@richvdh
Copy link
Member

richvdh commented Jan 21, 2023

When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.

It looks like this may have been introduced by #13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...

@richvdh
Copy link
Member

richvdh commented Jan 21, 2023

I wonder if it would be better to split this into two issues:

  • slow updates cause a restart loop
  • on sqlite, you get the added bonus of an inconsistent db

The latter could probably be fixed by "just" making sure that the transaction gets committed after the update to applied_schema_deltas. The former is a bit tricker.

@mi-gri
Copy link

mi-gri commented Jan 21, 2023

Our server was affected by the problem.
After an upgrade (1.68 to 1.75) it only loops:

synapse.app._base - 207 - ERROR - main - Exception during startup
Traceback (most recent call last):
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/app/homeserver.py", line 340, in setup
hs.setup()
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/server.py", line 310, in setup
self.datastores = Databases(self.DATASTORE_CLASS, self)
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/databases/init.py", line 74, in init
prepare_database(
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 136, in prepare_database
_upgrade_existing_database(
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 520, in _upgrade_existing_database
database_engine.execute_script_file(cur, absolute_path)
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/_base.py", line 145, in execute_script_file
cls.executescript(cursor, f.read())
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/sqlite.py", line 144, in executescript
cursor.executescript(script)
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 395, in executescript
self._do_execute(self.txn.executescript, sql) # type: ignore[attr-defined]
File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 436, in _do_execute
return func(sql, *args, **kwargs)
sqlite3.OperationalError: table threads already exists

With the help of @richvdh I was able to do the following and get the server running again:

  1. stop the server if it is still running (systemctl stop matrix-synapse)
  2. backup the database file (e.g. cp homeserver.db homeserver.db.bak)
  3. sqlite3 homeserver.db
  4. DROP TABLE threads;
  5. DELETE FROM background_updates WHERE update_name='threads_backfill';
  6. .exit
  7. update_synapse_database --database-config /etc/matrix-synapse/homeserver.yaml
    and if that finished without issues:
  8. restart the server (systemctl start matrix-synapse)

@richvdh
Copy link
Member

richvdh commented Jan 21, 2023

  • DROP TABLE threads;
  • DELETE FROM background_updates WHERE update_name='threads_backfill';

Be aware that this is not a general solution to this issue: it is only appropriate if the upgrade fails at a very specific point. If your server is hanging somewhere else in the upgrade cycle, these commands could cause data loss and additional problems.

@DMRobertson
Copy link
Contributor

When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.

It looks like this may have been introduced by #13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...

Sorry, that's my mistake. FWIW the old parser still exists, see e.g.

def get_statements(f: Iterable[str]) -> Generator[str, None, None]:

The only other option would be to inject an explicit BEGIN TRANSACTION and COMMIT before and after the script. But that's not a like-for-like replacement for what had before. Maybe it's best to revert back.

(I recall that when I was making the full schema dumps that parser couldn't handle something postgres-specific. Trigger definitions maybe?)

@richvdh
Copy link
Member

richvdh commented Jan 24, 2023

The only other option would be to inject an explicit BEGIN TRANSACTION and COMMIT before and after the script. But that's not a like-for-like replacement for what had before

It's not, but I think it would have the right semantics, and would avoid us maintaining our own SQL parser, with all the pitfalls that brings.

@squahtx
Copy link
Contributor Author

squahtx commented Jan 24, 2023

The non-transactional execution on sqlite has been split out into #14909.

@squahtx squahtx changed the title Long running schema deltas are killed by systemd and lead to bootlooping or inconsistent databases Long running schema deltas are killed by systemd and lead to bootlooping Jan 24, 2023
@aceArt-GmbH
Copy link
Contributor

aceArt-GmbH commented Sep 26, 2023

Not sure if this is worth a separate issue. Synapse 1.92.3. Postgres 16 Fresh database
synapse logs
2023-09-26 10:43:45,982 - root - 348 - WARNING - main - ***** STARTING SERVER *****
2023-09-26 10:43:45,982 - root - 349 - WARNING - main - Server /opt/venvs/matrix-synapse/lib/python3.10/site-packages/synapse/app/homeserver.py version 1.92.3
2023-09-26 10:43:45,982 - root - 354 - INFO - main - Server hostname: my.server
2023-09-26 10:43:45,983 - root - 355 - INFO - main - Instance name: master
2023-09-26 10:43:45,983 - root - 356 - INFO - main - Twisted reactor: EPollReactor
2023-09-26 10:43:45,983 - synapse.app.homeserver - 350 - INFO - main - Setting up server
2023-09-26 10:43:45,983 - synapse.server - 338 - INFO - main - Setting up.
2023-09-26 10:43:46,000 - synapse.storage.databases - 66 - INFO - main - [database config 'master']: Checking database server
2023-09-26 10:43:46,001 - synapse.storage.databases - 69 - INFO - main - [database config 'master']: Preparing for databases ['main', 'state']
2023-09-26 10:43:46,001 - synapse.storage.prepare_database - 128 - INFO - main - ['main', 'state']: Checking existing schema version
2023-09-26 10:43:46,002 - synapse.storage.prepare_database - 132 - INFO - main - ['main', 'state']: Existing schema is 81 (+0 deltas)
2023-09-26 10:43:46,002 - synapse.storage.databases.main - 339 - INFO - main - Checking database for consistency with configuration...
2023-09-26 10:43:46,002 - synapse.storage.prepare_database - 426 - INFO - main - Applying schema deltas for v81
2023-09-26 10:43:46,003 - synapse.storage.prepare_database - 561 - INFO - main - Schema now up to date
2023-09-26 10:43:46,003 - synapse.storage.databases - 84 - INFO - main - [database config 'master']: Starting 'main' database
2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for pushers(id): 1
2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for deleted_pushers(stream_id): 1
2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules_stream(stream_id): 1
2023-09-26 10:43:46,014 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for e2e_cross_signing_keys(stream_id): 1
2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1152 - INFO - main - Searching for stream ordering 1 month ago
2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1156 - INFO - main - Found stream ordering 1 month ago: it's 0
2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1159 - INFO - main - Searching for stream ordering 1 day ago
2023-09-26 10:43:46,015 - synapse.storage.databases.main.event_push_actions - 1163 - INFO - main - Found stream ordering 1 day ago: it's 0
2023-09-26 10:43:46,015 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules(id): 1
2023-09-26 10:43:46,015 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules_enable(id): 1
2023-09-26 10:43:46,017 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for access_tokens(id): 1
2023-09-26 10:43:46,017 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for refresh_tokens(id): 1
2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for event_reports(id): 1
2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_stream(stream_id): 1
2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for user_signature_stream(stream_id): 1
2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_outbound_pokes(stream_id): 1
2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_changes_in_room(stream_id): 1
2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_remote_pending(stream_id): 1
2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_changes_converted_stream_position(stream_id): 1
2023-09-26 10:43:46,020 - synapse.storage.databases - 101 - INFO - main - [database config 'master']: Starting 'state' database
2023-09-26 10:43:46,020 - synapse.storage.databases - 116 - INFO - main - [database config 'master']: prepared
2023-09-26 10:43:46,020 - synapse.server - 341 - INFO - main - Finished setting up.
2023-09-26 10:43:46,021 - twisted - 275 - INFO - main - Redirected stdout/stderr to logs
2023-09-26 10:43:46,021 - synapse.app.homeserver - 175 - INFO - sentinel - Running
2023-09-26 10:43:46,023 - synapse.app.homeserver - 29 - INFO - sentinel - Set file limit to: 524288
2023-09-26 10:43:46,063 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'profile'
2023-09-26 10:43:46,063 - synapse.push.pusher - 38 - INFO - sentinel - email enable notifs: False
2023-09-26 10:43:46,066 - synapse.server - 885 - INFO - sentinel - Connecting to redis (path='/var/run/redis/redis-server.sock') for external cache
2023-09-26 10:43:46,066 - synapse.replication.tcp.redis - 285 - INFO - sentinel - Connecting to redis server UNIXAddress('/var/run/redis/redis-server.sock')
2023-09-26 10:43:46,067 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.device_list_update'
2023-09-26 10:43:46,067 - sso_mapper - 19 - INFO - sentinel - MAPPING:__init__:{"SSO_me_url": "https://api.my.server/idm-dil/rest/me"}
2023-09-26 10:43:46,174 - synapse.http.client - 414 - INFO - sentinel - Received response to GET https://idp.myserver.de:8443/realms/national/protocol/openid-connect/certs: 200
2023-09-26 10:43:46,175 - shared_secret_authenticator - 55 - INFO - sentinel - Enabled login types: ['com.devture.shared_secret_auth', 'm.login.password']
2023-09-26 10:43:46,175 - synapse.app._base - 579 - INFO - sentinel - Loaded module <shared_secret_authenticator.SharedSecretAuthProvider object at 0x7f60e9493ac0>
2023-09-26 10:43:46,175 - synapse.util.caches.lrucache - 223 - INFO - sentinel - Expiring LRU caches after 1800 seconds
2023-09-26 10:43:46,176 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.presence'
2023-09-26 10:43:46,177 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'directory'
2023-09-26 10:43:46,177 - synapse.replication.tcp.redis - 285 - INFO - sentinel - Connecting to redis server UNIXAddress('/var/run/redis/redis-server.sock')
2023-09-26 10:43:46,178 - synapse.handlers.pagination - 181 - INFO - sentinel - Setting up purge job with config: RetentionPurgeJob(interval=86400000, shortest_max_lifetime=None, longest_max_lifetime=None)
2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.signing_key_update'
2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'org.matrix.signing_key_update'
2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'client_keys'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.health.HealthResource object at 0x7f60e92f8cd0> to path b'/health'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <twisted.web.resource.EncodingResourceWrapper object at 0x7f60e932b1f0> to path b'/_matrix/client'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <twisted.web.resource.Resource object at 0x7f60e932b250> to path b'/.well-known'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.admin.AdminRestResource object at 0x7f60e932b430> to path b'/_synapse/admin'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.synapse.client.pick_idp.PickIdpResource object at 0x7f60e934dd20> to path b'/_synapse/client/pick_idp'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <twisted.web.resource.Resource object at 0x7f60e934dde0> to path b'/_synapse/client/pick_username'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.synapse.client.new_user_consent.NewUserConsentResource object at 0x7f60e934e0e0> to path b'/_synapse/client/new_user_consent'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.synapse.client.sso_register.SsoRegisterResource object at 0x7f60e934e110> to path b'/_synapse/client/sso_register'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.synapse.client.unsubscribe.UnsubscribeResource object at 0x7f60e934e380> to path b'/_synapse/client/unsubscribe'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.synapse.client.oidc.OIDCResource object at 0x7f60e92e2b60> to path b'/_synapse/client/oidc'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching FilePath('/opt/venvs/matrix-synapse/lib/python3.10/site-packages/synapse/static') to path b'/_matrix/static'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.media.media_repository_resource.MediaRepositoryResource object at 0x7f60e934e740> to path b'/_matrix/media/r0'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.media.media_repository_resource.MediaRepositoryResource object at 0x7f60e934e740> to path b'/_matrix/media/v3'
2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.media.media_repository_resource.MediaRepositoryResource object at 0x7f60e934e740> to path b'/_matrix/media/v1'
2023-09-26 10:43:46,184 - twisted - 275 - INFO - sentinel - SynapseSite starting on '/sock/8000.sock'
2023-09-26 10:43:46,184 - synapse.app._base - 442 - INFO - sentinel - Synapse now listening on Unix Socket at: /sock/8000.sock
2023-09-26 10:43:46,187 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.rest.health.HealthResource object at 0x7f60e934f430> to path b'/health'
2023-09-26 10:43:46,187 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching <synapse.replication.http.ReplicationRestResource object at 0x7f60e934f550> to path b'/_synapse/replication'
2023-09-26 10:43:46,187 - twisted - 275 - INFO - sentinel - SynapseSite starting on '/sock/9000.sock'
2023-09-26 10:43:46,187 - synapse.app._base - 442 - INFO - sentinel - Synapse now listening on Unix Socket at: /sock/9000.sock
2023-09-26 10:43:46,237 - synapse.storage.background_updates - 400 - INFO - background_updates-0 - Starting background schema updates for database master
2023-09-26 10:43:46,238 - synapse.replication.tcp.redis - 119 - INFO - sentinel - Connected to redis
2023-09-26 10:43:46,239 - synapse.replication.tcp.redis - 131 - INFO - subscribe-replication-0 - Sending redis SUBSCRIBE for ['my.server/USER_IP', 'my.server']
2023-09-26 10:43:46,239 - synapse.push.pusherpool - 373 - INFO - start_pushers-0 - Started pushers
2023-09-26 10:43:46,239 - synapse.storage.background_updates - 418 - INFO - background_updates-0 - No more background updates to do. Unscheduling background update task.
2023-09-26 10:43:46,239 - synapse.replication.tcp.redis - 134 - INFO - subscribe-replication-0 - Successfully subscribed to redis stream, sending REPLICATE command
2023-09-26 10:43:46,240 - synapse.replication.tcp.redis - 139 - INFO - subscribe-replication-0 - REPLICATE successfully sent
2023-09-26 10:44:16,177 - synapse.util.caches.lrucache - 210 - INFO - LruCache._expire_old_entries-0 - Dropped 0 items from caches
2023-09-26 10:44:46,176 - synapse.util.caches.lrucache - 210 - INFO - LruCache._expire_old_entries-1 - Dropped 0 items from caches
2023-09-26 10:45:15,185 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down.
Seems like systemd SIGTERMs synapse. Increasing service `TimeoutSec` just delays things. Postgres is not showing any running queries. Connections are idle
---

Resolved. Wrong systemd config where ExecStartPre never finished

@DMRobertson
Copy link
Contributor

Not sure if this is worth a separate issue.

@aceArt-GmbH The logs don't show that Synapse is getting killed during the schema migration:

2023-09-26 10:43:46,003 - synapse.storage.prepare_database - 561 - INFO - main - Schema now up to date
...
2023-09-26 10:45:15,185 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down.

You should look at systemd's logs to see what is happening. If they show evidence of a bug in Synapse, please open a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

6 participants