MBS-13449: Reimplement transactional cache without database locks #3130

mwiencek · 2023-12-19T17:07:40Z

Problem

The current implementation for MBS-7241 causes deadlocks and performance issues due to lock contention (MBS-11345, MBS-12314).

Solution

Implement a different solution that blocks cache additions using our Redis store to manage "cachelock" keys.

The new implementation is documented in Data::Role::EntityCache. Please review that to make sure it seems sound.

Testing

The existing test for MBS-7241 in GIDEntityCache passes. However, this hasn't been tested at scale. Complicating things, it also needs to be deployed to production and beta at the same time.

mwiencek · 2023-12-19T17:47:24Z

BTW, there is at least one minor race condition here:

Request A reads artist ID=1 from the database. It's not in the cache, and there is no cachelock key.
Request B starts before request A adds artist ID=1 to the cache. It updates or deletes artist ID=1 and fully commits.
Request A finally adds (the old version of) artist ID=1 to the cache.

However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too.

lib/MusicBrainz/Server/Data/Role/EntityCache.pm

yvanzo

LGTMBDNT. The comments to the code are very explanatory. I don’t know how we could reproduce potential race conditions in CI tests, so it’s fine to skip for now.

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

BTW, there is at least one minor race condition here:
* Request A reads artist ID=1 from the database. It's not in the cache, and there is no cachelock key.

* Request B starts before request A adds artist ID=1 to the cache. It updates or deletes artist ID=1 and fully commits.

* Request A finally adds (the old version of) artist ID=1 to the cache.
However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too.

Would it work to require request A to acquire the cachelock before loading the entity from the database?

mwiencek · 2024-01-09T17:12:55Z

@yvanzo Thanks for the review; actually I think I found a way to reproduce the race condition in the tests, and was working on a fix, but haven't pushed the changes yet.

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

Right, I will think about this more and add some documentation.

Would it work to require request A to acquire the cachelock before loading the entity from the database?

It would work if it was an exclusive lock, I think, though it's what I'm trying to move away from to avoid lock contention: the "deleting" transaction doesn't actually read the cachelock key at all, only sets it in order to temporarily prevent any "reading" transaction from adding to the cache.

There is one thing that I think would help, and that is https://redis.io/docs/interact/transactions/#watch-explained which we could use to WATCH the cachelock key; so when "request A finally adds (the old version of) artist ID=1 to the cache," it would not apply, because it would see that the cachelock key was modified in the meantime.

The only problem with this is that the current implementation here stores the cachelock keys in our KeyDB store rather than the cache. Watching a key in the store will have no effect on the cache. So we'd have to move the cachelock keys to the cache, which seems iffy, since it means they can theoretically be evicted before we're done with them. However, perhaps this isn't the issue it seems, since we use the LRU eviction policy and it should be very unlikely that it's evicted in the time span we're talking about. It would be interesting if we had some stats regarding "average age of keys when they're evicted."

mwiencek · 2024-01-14T23:28:30Z

I renamed cachelock to recently_invalidated to make it clear that it's not an actual lock. (Let me know if you prefer the previous name, or have a better suggestion.)

I also made some changes which should help resolve the race condition described above:

Instead of checking for a recently_invalidated key before we add something to the cache, we check it immediately after we add something to the cache, and just delete the added entry right away if it was in fact recently-invalidated. (We could also check it before, but I feel that there may be more of a downside to adding too many cache lookups to handle an already-rare occurrence.) So, there is a very brief interval where a stale version of the entity may be returned, but it's limited to sub-milliseconds rather than whatever the ENTITY_CACHE_TTL is!
The recently_invalidated keys are no longer deleted after the deleting transaction ends; this was a mistake because there may still be stale inserting transactions open after the deleting one commits. Instead, we always leave them to expire after the max request time has elapsed.

A test has been added for the scenario described in #3130 (comment), and another commit has been added to follow-up on an issue I found for #2751 (comment).

mwiencek · 2024-01-14T23:40:42Z

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

I don't think anything special needs to be done, actually, since there is nothing currently in the cache that would conflict with this PR. It's only making use of some new keys being added.

After deploying this to beta, there is a minor chance that the type of issue described by MBS-7241 can arise -- either a deleted entity remaining in the cache, or the cache returning an outdated version of an entity -- until it's also deployed to production.

yvanzo · 2024-01-15T12:51:37Z

Do you now intend to deploy it with the usual beta rather than to deploy it to both beta and production at the same time?
Should it be tracked with a separate ticket?

mwiencek · 2024-01-16T21:27:15Z

Do you now intend to deploy it with the usual beta rather than to deploy it to both beta and production at the same time?
Should it be tracked with a separate ticket?

Added MBS-13449. I think deploying it to production still makes sense, since if any cache issues arise, it might otherwise be unclear if it's due to a bug in the code, or due to a lack of symmetry between the code on production and beta.

mwiencek · 2024-01-23T16:12:54Z

https://ci.metabrainz.org/job/musicbrainz-server/10292/ passed (which was accidentally based on master), and there is another test run queued based on production. (But I think the above run is sufficient to show that tests work now.)

reosarevok · 2024-01-24T07:30:03Z

t/lib/t/MusicBrainz/DataStore/RedisMulti.pm


 my $args2 = DBDefs->DATASTORE_REDIS_ARGS;
-$args2->{database} = $args2->{test_database} + 1;
+$args2->{database} = DBDefs->REDIS_TEST_DATABASE + 1;


Given this here, should the docs in DBDefs say that both REDIS_TEST_DATABASE and REDIS_TEST_DATABASE + 1 will be completely erased on each test run? Or should we define REDIS_SECOND_TEST_DATABASE? You know someone will use 0 for test and 1 for production some day.

Yeah, I think we should just mention that any Redis database numbers greater than or equal to REDIS_TEST_DATABASE may be used for tests.

Edit: Updated the text in lib/DBDefs/Default.pm; let me know if that's better.

lib/MusicBrainz/Server/Data/Release.pm

mwiencek · 2024-01-31T15:22:43Z

(Ignore the pushed diffs: I've accidentally rebased on master several times when this PR is based on production, and had to revert the changes.)

`Data::Role::EntityCache` expects all consumers' `merge` methods to have a signature of `($self, $new_id, @old_ids)`, particularly in `after merge` where `_delete_from_cache` is called. Artist was the only model whose signature did not match, which caused the `EntityCache` code to attempt to delete keys like `artist:ARRAY(0x1234567)` and whatever is in `%opts`. I changed the existing method name to `merge_with_opts`, and added a new `merge` implementation that matches other entities (and can only be called without options). I expected that this would fix a bug where the "old" IDs were not being deleted for artists in `after merge`. However, this commit doesn't seem to fix any bug, because the old IDs are deleted separately by `_delete_and_redirect_gids` in `Data::Artist::merge` (now `merge_with_opts`); the only issue being fixed here is us attempting to delete bogus keys from Redis, but that did not cause any other visible issue.

There is another variable declared as `$data` below, which is only distinguished from `%data` by the sigil.

`_create_cache_entries` takes a hash ref, so we can just pass the ref through this way. Nothing in `_add_to_cache` requires a hash.

yvanzo

Thanks for the added tests using nested transactions, because race conditions would be very difficult to reproduce otherwise!

Found minor oversights in text mostly but still have one question:

Any reason not to mark the artist as recently invalidated in cache even before updating it in the database? That would further reduce the chances to have a cached stale artist.

lib/MusicBrainz/Server/Data/Artist.pm

lib/MusicBrainz/Server/Data/Role/EntityCache.pm

t/lib/t/Context.pm

t/lib/t/MusicBrainz/Server/Data/Artist.pm

lib/DBDefs/Default.pm

t/lib/t/MusicBrainz/Server/Data/Role/GIDEntityCache.pm

mwiencek · 2024-02-06T14:26:50Z

Any reason not to mark the artist as recently invalidated in cache even before updating it in the database? That would further reduce the chances to have a cached stale artist.

IIUC this is already the case: in _delete_from_cache, set_multi(@invalidated_flags) (paraphrasing) is already called before delete_multi(@ids); and _delete_from_cache itself is called before the database transaction which updates/deletes the entity commits.

This moves the calculation of the cache prefix (`$self->_type . ':'`) to a static attribute.

…ity`

I find "abc" to be more confusing than it needs to be when trying to debug these tests, because I don't immediately recognize it as a GID lookup.

The list of IDs to cache may be truncated by `$MAX_CACHE_ENTRIES`, so calling `keys %$data` will not be equivalent everywhere. This commit is purely refactoring, but allows consistent access to the truncated IDs in a future commit.

If we build multiple contexts, I would expect them to use separate connectors without having to build those manually. This is needed in an upcoming commit which adds a `GIDEntityCache` test making use of multiple connectors.

(Refactoring only.) The code being replaced here is basically duplicating what `cache_aware_c` already provides.

When using `cache_aware_c` in the tests, we currently don't clear the Redis test database between test cases (similar to how we rollback `$c->sql`). We also don't appear to be initializing the Redis store to use the `test_database` number. This commit adds a separate configuration option for the minimum Redis test database number to DBDefs.pm, so that we can clear the correct database number for both the cache and the store after each test case.

This is intended to help resolve deadlocks (MBS-11345, MBS-12314) and performance issues due to lock contention. It supplants metabrainz#212. The new implementation is documented in `Data::Role::EntityCache`.

If you call `get_by_gid` with a redirected gid, we'll cache the resolved gid of the entity, but not the redirected one that was passed in. This fixes that.

Initialize the Redis cache/store in the normal `c` instead of requiring access to a separate `cache_aware_c` in the tests. Remove the mock caches, and just use the Redis caches, which is a better test of real code; in fact it revealed at least one bug where we don't invalidate releases from the cache when their ASINs are updated (fixed in the prior commit). I didn't see any point to the `EntityCache` and `GIDEntityCache` tests having to check how many times certain mock methods were called. It had no bearing on what ended up being stored in the cache, which was all that mattered.

yvanzo

Reviewed replies, pushed changes and commit messages, all good to go! 🚢

mwiencek force-pushed the nix-advisory-locks branch from fbdd10f to f61e349 Compare December 19, 2023 17:25

reosarevok requested a review from yvanzo January 9, 2024 09:35

reosarevok reviewed Jan 9, 2024

View reviewed changes

lib/MusicBrainz/Server/Data/Role/EntityCache.pm Outdated Show resolved Hide resolved

yvanzo reviewed Jan 9, 2024

View reviewed changes

mwiencek force-pushed the nix-advisory-locks branch from f61e349 to 34393b1 Compare January 14, 2024 04:18

mwiencek changed the base branch from production to master January 14, 2024 04:21

mwiencek force-pushed the nix-advisory-locks branch 2 times, most recently from 8ebce73 to b2d4b82 Compare January 14, 2024 23:13

mwiencek force-pushed the nix-advisory-locks branch from b2d4b82 to 98b0c85 Compare January 14, 2024 23:34

mwiencek changed the base branch from master to production January 14, 2024 23:34

mwiencek force-pushed the nix-advisory-locks branch from 98b0c85 to 4c43d61 Compare January 16, 2024 06:27

mwiencek changed the title ~~Reimplement MBS-7241 without database-level locks~~ MBS-13449: Reimplement transactional cache without database locks Jan 16, 2024

mwiencek force-pushed the nix-advisory-locks branch from 4c43d61 to f981f2d Compare January 16, 2024 20:34

mwiencek force-pushed the nix-advisory-locks branch 3 times, most recently from 52cf980 to b816524 Compare January 23, 2024 15:37

reosarevok reviewed Jan 24, 2024

View reviewed changes

lib/MusicBrainz/Server/Data/Release.pm Show resolved Hide resolved

mwiencek force-pushed the nix-advisory-locks branch 3 times, most recently from 27d6772 to 2e8b887 Compare January 31, 2024 15:17

Fix Selenium test broken by MBS-13420 (metabrainz#3138)

c030053

mwiencek force-pushed the nix-advisory-locks branch from 2e8b887 to 455bab7 Compare February 2, 2024 16:48

mwiencek added 3 commits February 5, 2024 18:26

Rename variable for clarity

ca9b744

There is another variable declared as `$data` below, which is only distinguished from `%data` by the sigil.

Change _add_to_cache to accept a hash ref

7332d6f

`_create_cache_entries` takes a hash ref, so we can just pass the ref through this way. Nothing in `_add_to_cache` requires a hash.

mwiencek force-pushed the nix-advisory-locks branch 2 times, most recently from a5f376b to 0293336 Compare February 6, 2024 01:22

yvanzo reviewed Feb 6, 2024

View reviewed changes

mwiencek added 11 commits February 6, 2024 08:29

Add _cache_prefix to Data::Role::EntityCache

a42ea56

This moves the calculation of the cache prefix (`$self->_type . ':'`) to a static attribute.

Change @ids filter in Data::Role::EntityCache to match `Data::Ent…

8d83566

…ity`

Use an actual GID in the GIDEntityCache tests

bf2ab11

I find "abc" to be more confusing than it needs to be when trying to debug these tests, because I don't immediately recognize it as a GID lookup.

Pass ID list around through _add_to_cache

899817c

The list of IDs to cache may be truncated by `$MAX_CACHE_ENTRIES`, so calling `keys %$data` will not be equivalent everywhere. This commit is purely refactoring, but allows consistent access to the truncated IDs in a future commit.

Set fresh_connector in _build_cache_aware_context

de38fc9

If we build multiple contexts, I would expect them to use separate connectors without having to build those manually. This is needed in an upcoming commit which adds a `GIDEntityCache` test making use of multiple connectors.

Make use of cache_aware_c in Data::Artist test

5a69bf5

(Refactoring only.) The code being replaced here is basically duplicating what `cache_aware_c` already provides.

MBS-13449: Reimplement transactional cache without database locks

82c79f5

This is intended to help resolve deadlocks (MBS-11345, MBS-12314) and performance issues due to lock contention. It supplants metabrainz#212. The new implementation is documented in `Data::Role::EntityCache`.

Cache redirected gids

b4890b6

If you call `get_by_gid` with a redirected gid, we'll cache the resolved gid of the entity, but not the redirected one that was passed in. This fixes that.

Clear release from cache when ASIN is updated

78806e9

mwiencek force-pushed the nix-advisory-locks branch from 0293336 to f4d3a5a Compare February 7, 2024 03:52

yvanzo approved these changes Feb 7, 2024

View reviewed changes

mwiencek merged commit 7ed2a0a into metabrainz:production Feb 8, 2024
3 checks passed

mwiencek deleted the nix-advisory-locks branch February 8, 2024 18:59

yvanzo mentioned this pull request Feb 12, 2024

MBS-13481: Invalidate release on label changes #3170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBS-13449: Reimplement transactional cache without database locks #3130

MBS-13449: Reimplement transactional cache without database locks #3130

mwiencek commented Dec 19, 2023

mwiencek commented Dec 19, 2023

yvanzo left a comment

mwiencek commented Jan 9, 2024

mwiencek commented Jan 14, 2024 •

edited

mwiencek commented Jan 14, 2024

yvanzo commented Jan 15, 2024

mwiencek commented Jan 16, 2024 •

edited

mwiencek commented Jan 23, 2024

reosarevok Jan 24, 2024

mwiencek Jan 31, 2024 •

edited

mwiencek commented Jan 31, 2024

yvanzo left a comment

mwiencek commented Feb 6, 2024

yvanzo left a comment

MBS-13449: Reimplement transactional cache without database locks #3130

MBS-13449: Reimplement transactional cache without database locks #3130

Conversation

mwiencek commented Dec 19, 2023

Problem

Solution

Testing

mwiencek commented Dec 19, 2023

yvanzo left a comment

Choose a reason for hiding this comment

mwiencek commented Jan 9, 2024

mwiencek commented Jan 14, 2024 • edited

mwiencek commented Jan 14, 2024

yvanzo commented Jan 15, 2024

mwiencek commented Jan 16, 2024 • edited

mwiencek commented Jan 23, 2024

reosarevok Jan 24, 2024

Choose a reason for hiding this comment

mwiencek Jan 31, 2024 • edited

Choose a reason for hiding this comment

mwiencek commented Jan 31, 2024

yvanzo left a comment

Choose a reason for hiding this comment

mwiencek commented Feb 6, 2024

yvanzo left a comment

Choose a reason for hiding this comment

mwiencek commented Jan 14, 2024 •

edited

mwiencek commented Jan 16, 2024 •

edited

mwiencek Jan 31, 2024 •

edited