Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Remove old rows from the cache_invalidation_stream_by_instance table automatically. (This table is not used when Synapse is configured to use SQLite.) #15868

Merged
merged 12 commits into from
Aug 8, 2023

Conversation

reivilibre
Copy link
Contributor

I propose a mechanism to clean up old (> 7d old, to be very generous) rows from the cache_invalidation_stream_by_instance table.

On LPnet, this table is 700 MB (4e6 rows) and has never been cleaned before. On Morg, this table is 191 GB (8e8 rows) but seems to have been manually cleaned in 2020.

These rows are only needed for mere instants whilst other workers catch up with the stream, so there is no utility in keeping them around for longer than a few instants.
However, to allow for clock drift, debugging and just generally out of caution, I set the cut-off to 7 days.

Fixes: #3665

Base: develop

Original commit schedule, with full messages:

  1. Add a cache invalidation clean-up task

  2. Run the cache invalidation stream clean-up on the background worker

  3. Tune down

  4. call_later is in millis!

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
@reivilibre reivilibre force-pushed the rei/cleanup_cache_invalidation branch from d8e5721 to fe7c6a9 Compare July 3, 2023 14:53
@reivilibre reivilibre marked this pull request as ready for review July 3, 2023 16:25
@reivilibre reivilibre requested a review from a team as a code owner July 3, 2023 16:25
@MadLittleMods MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label Jul 11, 2023
Copy link
Contributor

@MadLittleMods MadLittleMods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things

synapse/storage/databases/main/cache.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/cache.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/cache.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/cache.py Outdated Show resolved Hide resolved
Comment on lines +648 to +672
# Determine whether we are caught up or still catching up
txn.execute(
"""
SELECT invalidation_ts FROM cache_invalidation_stream_by_instance
WHERE stream_id > ?
ORDER BY stream_id ASC
LIMIT 1
""",
(cutoff_stream_id,),
)
row = txn.fetchone()
if row is None:
in_backlog = False
else:
# We are in backlog if the next row could have been deleted
# if we didn't have such a small batch size
in_backlog = row[0] <= delete_up_to_millisec

txn.execute(
"""
DELETE FROM cache_invalidation_stream_by_instance
WHERE ? <= stream_id AND stream_id <= ?
""",
(earliest_stream_id, cutoff_stream_id),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could easily determine the outcome of the deletion.

There is this to see how many rows were deleted if we're interested in that boilerplate.

(not saying we have to use it as the current way is simple, just cumbersome).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered checking the count of deleted rows but these streams can technically have gaps so you don't actually know how many rows are to be deleted, only the upper bound.

DELETE ... RETURNING ... was an option too but it felt less obvious to me and I wanted it to be 'obviously correct', whereas I wouldn't have as much confidence writing something that way

synapse/storage/databases/main/cache.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/cache.py Show resolved Hide resolved
@reivilibre reivilibre requested a review from a team July 21, 2023 11:57
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Have you had a chance to test this on your homeserver?

Only two other thoughts spring to mind is:

  • is it worth adding logging for this, so you can see the cleanup is making forward progress? I guess not; you can always SELECT min(stream_id) FROM ... the table.
  • Is this a generic mechanism/are there other streams whose data is safe to clean out periodically? I suggest you ignore this question so we can land this improvement as-is.

Comment on lines 587 to 588
def _clean_up_cache_invalidation_wrapper(self) -> None:
async def _clean_up_cache_invalidation_background() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be slightly clearer if we just had one function that was decorated with @wrap_as_background_process? (Is that even equivalent? 🤷)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only learnt about this wrapper very recently, I will give it a go :)

Comment on lines +662 to +678
# Determine whether we are caught up or still catching up
txn.execute(
"""
SELECT invalidation_ts FROM cache_invalidation_stream_by_instance
WHERE stream_id > ?
ORDER BY stream_id ASC
LIMIT 1
""",
(cutoff_stream_id,),
)
row = txn.fetchone()
if row is None:
in_backlog = False
else:
# We are in backlog if the next row could have been deleted
# if we didn't have such a small batch size
in_backlog = row[0] <= delete_up_to_millisec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, but I probably would have queried for something like

                SELECT stream_id FROM cache_invalidation_stream_by_instance
                WHERE stream_id > ? OR invalidation_ts > ?
                ORDER BY stream_id ASC
                LIMIT 1 

i.e. to see if the complement of rows covered by the previous query is nonempty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I genuinely do only want to look at the first row after our current marker: adding more conditions like you suggest is either going to cause more work for the DB as it'd have to scan all the remaining rows until finding one (I think?) or in the case of the OR condition, it's going to not be able to use any index for that part of the condition.

Have to be honest I don't quite follow your suggestion's query, at first I was thinking you wanted WHERE stream_id > ? AND invalidation_ts <= ? but altogether not sure. Ignoring that I don't fully grok what you're saying, I think OR invalidation_ts > ? is potentially problematic as we don't have an index on invalidation_ts — I can't see how the query planner would make a generally efficient plan for the query :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought you were deleting things based on both stream_id and invalidation_ts. If so, it seemed odd to not take both into account when working out if had deleted everything that you wanted to.

I see now that you're only deleting things based on stream_id, using the invalidation_ts only to select a sensible upper limit on stream_id.

Paranoid note: it is not necessarily the case that x.stream_id > y.stream_id implies x.invalidation_ts > y.invalidation_ts (consider e.g. one worker is slow to complete transactions or has a laggy clock). I wouldn't expect this to have any meaingful effect in practice because any drifts should be small.

@reivilibre
Copy link
Contributor Author

Have you had a chance to test this on your homeserver?

yup!

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy iff CI is.

@reivilibre reivilibre force-pushed the rei/cleanup_cache_invalidation branch from 366154b to ca52419 Compare August 7, 2023 15:15
@reivilibre reivilibre merged commit f3dc6dc into develop Aug 8, 2023
37 checks passed
@reivilibre reivilibre deleted the rei/cleanup_cache_invalidation branch August 8, 2023 10:10
yingziwu added a commit to yingziwu/synapse that referenced this pull request Aug 21, 2023
No significant changes since 1.90.0rc1.

- Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629))
- Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868))

- Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791))
- Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012))
- Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043))
- Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046))

- Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009))

- Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015))
- Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016))
- Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027))

- Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964))
- Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017))

- Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525))
- Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754))
- Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991))
- Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992))
- Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993))
- Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013))
- Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019))
- Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031))
- Combine duplicated code. ([\matrix-org#16023](matrix-org#16023))
- Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028))
- Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068))
- Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069))

* Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011))
* Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048))
* Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077))
* Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034))
* Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044))
* Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081))
* Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076))
* Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073))
* Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982))
* Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033))
* Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074))
* Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032))
* Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038))
* Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037))
* Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036))
* Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035))
* Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078))
* Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079))
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 18, 2023
No significant changes since 1.90.0rc1.

- Scope transaction IDs to devices (implement [MSC3970](matrix-org/matrix-spec-proposals#3970)). ([\matrix-org#15629](matrix-org#15629))
- Remove old rows from the `cache_invalidation_stream_by_instance` table automatically (this table is unused in SQLite). ([\matrix-org#15868](matrix-org#15868))

- Fix a long-standing bug where purging history and paginating simultaneously could lead to database corruption when using workers. ([\matrix-org#15791](matrix-org#15791))
- Fix a long-standing bug where profile endpoint returned a 404 when the user's display name was empty. ([\matrix-org#16012](matrix-org#16012))
- Fix a long-standing bug where the `synapse_port_db` failed to configure sequences for application services and partial stated rooms. ([\matrix-org#16043](matrix-org#16043))
- Fix long-standing bug with deletion in dehydrated devices v2. ([\matrix-org#16046](matrix-org#16046))

- Add `org.opencontainers.image.version` labels to Docker containers [published by Matrix.org](https://hub.docker.com/r/matrixdotorg/synapse). Contributed by Mo Balaa. ([\matrix-org#15972](matrix-org#15972), [\matrix-org#16009](matrix-org#16009))

- Add a internal documentation page describing the ["streams" used within Synapse](https://matrix-org.github.io/synapse/v1.90/development/synapse_architecture/streams.html). ([\matrix-org#16015](matrix-org#16015))
- Clarify comment on the keys/upload over replication enpoint. ([\matrix-org#16016](matrix-org#16016))
- Do not expose Admin API in caddy reverse proxy example. Contributed by @NilsIrl. ([\matrix-org#16027](matrix-org#16027))

- Remove support for legacy application service paths. ([\matrix-org#15964](matrix-org#15964))
- Move support for application service query parameter authorization behind a configuration option. ([\matrix-org#16017](matrix-org#16017))

- Update SQL queries to inline boolean parameters as supported in SQLite 3.27. ([\matrix-org#15525](matrix-org#15525))
- Allow for the configuration of the backoff algorithm for federation destinations. ([\matrix-org#15754](matrix-org#15754))
- Allow modules to check whether the current worker is configured to run background tasks. ([\matrix-org#15991](matrix-org#15991))
- Update support for [MSC3958](matrix-org/matrix-spec-proposals#3958) to match the latest revision of the MSC. ([\matrix-org#15992](matrix-org#15992))
- Allow modules to schedule delayed background calls. ([\matrix-org#15993](matrix-org#15993))
- Properly overwrite the `redacts` content-property for forwards-compatibility with room versions 1 through 10. ([\matrix-org#16013](matrix-org#16013))
- Fix building the nix development environment on MacOS systems. ([\matrix-org#16019](matrix-org#16019))
- Remove leading and trailing spaces when setting a display name. ([\matrix-org#16031](matrix-org#16031))
- Combine duplicated code. ([\matrix-org#16023](matrix-org#16023))
- Collect additional metrics from `ResponseCache` for eviction. ([\matrix-org#16028](matrix-org#16028))
- Fix endpoint improperly declaring support for MSC3814. ([\matrix-org#16068](matrix-org#16068))
- Drop backwards compat hack for event serialization. ([\matrix-org#16069](matrix-org#16069))

* Update PyYAML to 6.0.1. ([\matrix-org#16011](matrix-org#16011))
* Bump cryptography from 41.0.2 to 41.0.3. ([\matrix-org#16048](matrix-org#16048))
* Bump furo from 2023.5.20 to 2023.7.26. ([\matrix-org#16077](matrix-org#16077))
* Bump immutabledict from 2.2.4 to 3.0.0. ([\matrix-org#16034](matrix-org#16034))
* Update certifi to 2023.7.22 and pygments to 2.15.1. ([\matrix-org#16044](matrix-org#16044))
* Bump jsonschema from 4.18.3 to 4.19.0. ([\matrix-org#16081](matrix-org#16081))
* Bump phonenumbers from 8.13.14 to 8.13.18. ([\matrix-org#16076](matrix-org#16076))
* Bump regex from 1.9.1 to 1.9.3. ([\matrix-org#16073](matrix-org#16073))
* Bump serde from 1.0.171 to 1.0.175. ([\matrix-org#15982](matrix-org#15982))
* Bump serde from 1.0.175 to 1.0.179. ([\matrix-org#16033](matrix-org#16033))
* Bump serde from 1.0.179 to 1.0.183. ([\matrix-org#16074](matrix-org#16074))
* Bump serde_json from 1.0.103 to 1.0.104. ([\matrix-org#16032](matrix-org#16032))
* Bump service-identity from 21.1.0 to 23.1.0. ([\matrix-org#16038](matrix-org#16038))
* Bump types-commonmark from 0.9.2.3 to 0.9.2.4. ([\matrix-org#16037](matrix-org#16037))
* Bump types-jsonschema from 4.17.0.8 to 4.17.0.10. ([\matrix-org#16036](matrix-org#16036))
* Bump types-netaddr from 0.8.0.8 to 0.8.0.9. ([\matrix-org#16035](matrix-org#16035))
* Bump types-opentracing from 2.4.10.5 to 2.4.10.6. ([\matrix-org#16078](matrix-org#16078))
* Bump types-setuptools from 68.0.0.0 to 68.0.0.3. ([\matrix-org#16079](matrix-org#16079))

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE8SRSDO7gYkSP4chELS76LzL74EcFAmTbUOEACgkQLS76LzL7
# 4Efskw/+J4X30PoqSvBWbilr8kTzwNSDXrkefYXR2sLburgowCyuAtKtCdbvnZUX
# 3KRwii5/GDsduXiNY836oRxO/KWE43b1ce9C9qJM7V6NmgkJBgHRvnh69wdlmBqt
# 6b6TQHoEByYS7yK90+QsRm1Bqrw7eoVO9oxcZ+4lb7Mjswf491Pga8kFJqdvjtTX
# UXo4vAqYyP6Yn7sUrQmXy0N8gZ5ZFHhZEvZZ8+iEsNjPO468cSVGq8/iPB1EwBm2
# nbfZWMDnD2p7plJezXOPEBxnVR3RrWbCbK08SiiNMcQynCvBgAUfkd3GnsO726jb
# 19i8p6tjuj2r41UgqYCTG2i2ij6uJquA/qq3rIiVNQVKG9aPHQ8hJfu9XOdEvaJe
# Je9H6QFrU/hR640tFvb5Hdc/4ThabvtC5xgl4ZGT6y6I0s5LNwk8fJiz3sDFt0i1
# XKsqGVemBGopZnwjQIPFJaHjPT7of33PXLE/hf1vX+oXU/6MNbFYkDLY9nnnQeOx
# 0GEbiYaxrj8SfxNmEykMLNCfxwJ719cSR1q8vPYn6r6TOS1pJMV0SgciXoaQ/VW6
# WlRpZolvXYSye34JW8Rg4ojAz0oYfJ2IiUpwY7eSEq4DtuosTjEECKrgB8DLqKlM
# +qDd4/yeqVN5/sui5oGsR71aTMy/jdnzqmdmuFvsSwz9/7PfMEU=
# =FWp5
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue Aug 15 11:18:09 2023 BST
# gpg:                using RSA key F124520CEEE062448FE1C8442D2EFA2F32FBE047
# gpg: Can't check signature: No public key

# Conflicts:
#	.github/workflows/docker.yml
#	CHANGES.md
#	Cargo.lock
#	debian/changelog
#	docs/upgrade.md
#	flake.lock
#	poetry.lock
#	pyproject.toml
#	rust/src/push/base_rules.rs
#	synapse/events/utils.py
#	synapse/handlers/pagination.py
#	synapse/http/site.py
#	synapse/rest/client/devices.py
#	synapse/storage/databases/main/roommember.py
#	synapse/storage/schema/__init__.py
#	tests/rest/client/test_devices.py
#	tests/rest/client/test_redactions.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db
Projects
None yet
Development

Successfully merging this pull request may close these issues.

we never clear out cache_invalidation_stream
3 participants