Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-13449: Reimplement transactional cache without database locks #3130

Merged
merged 15 commits into from Feb 8, 2024

Conversation

mwiencek
Copy link
Member

Problem

The current implementation for MBS-7241 causes deadlocks and performance issues due to lock contention (MBS-11345, MBS-12314).

Solution

Implement a different solution that blocks cache additions using our Redis store to manage "cachelock" keys.

The new implementation is documented in Data::Role::EntityCache. Please review that to make sure it seems sound.

Testing

The existing test for MBS-7241 in GIDEntityCache passes. However, this hasn't been tested at scale. Complicating things, it also needs to be deployed to production and beta at the same time.

@mwiencek
Copy link
Member Author

BTW, there is at least one minor race condition here:

  • Request A reads artist ID=1 from the database. It's not in the cache, and there is no cachelock key.
  • Request B starts before request A adds artist ID=1 to the cache. It updates or deletes artist ID=1 and fully commits.
  • Request A finally adds (the old version of) artist ID=1 to the cache.

However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too.

@reosarevok reosarevok requested a review from yvanzo January 9, 2024 09:35
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMBDNT. The comments to the code are very explanatory. I don’t know how we could reproduce potential race conditions in CI tests, so it’s fine to skip for now.

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

BTW, there is at least one minor race condition here:

* Request A reads artist ID=1 from the database. It's not in the cache, and there is no cachelock key.

* Request B starts before request A adds artist ID=1 to the cache. It updates or deletes artist ID=1 and fully commits.

* Request A finally adds (the old version of) artist ID=1 to the cache.

However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too.

Would it work to require request A to acquire the cachelock before loading the entity from the database?

@mwiencek
Copy link
Member Author

mwiencek commented Jan 9, 2024

@yvanzo Thanks for the review; actually I think I found a way to reproduce the race condition in the tests, and was working on a fix, but haven't pushed the changes yet.

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

Right, I will think about this more and add some documentation.

Would it work to require request A to acquire the cachelock before loading the entity from the database?

It would work if it was an exclusive lock, I think, though it's what I'm trying to move away from to avoid lock contention: the "deleting" transaction doesn't actually read the cachelock key at all, only sets it in order to temporarily prevent any "reading" transaction from adding to the cache.

There is one thing that I think would help, and that is https://redis.io/docs/interact/transactions/#watch-explained which we could use to WATCH the cachelock key; so when "request A finally adds (the old version of) artist ID=1 to the cache," it would not apply, because it would see that the cachelock key was modified in the meantime.

The only problem with this is that the current implementation here stores the cachelock keys in our KeyDB store rather than the cache. Watching a key in the store will have no effect on the cache. So we'd have to move the cachelock keys to the cache, which seems iffy, since it means they can theoretically be evicted before we're done with them. However, perhaps this isn't the issue it seems, since we use the LRU eviction policy and it should be very unlikely that it's evicted in the time span we're talking about. It would be interesting if we had some stats regarding "average age of keys when they're evicted."

@mwiencek mwiencek changed the base branch from production to master January 14, 2024 04:21
@mwiencek mwiencek force-pushed the nix-advisory-locks branch 2 times, most recently from 8ebce73 to b2d4b82 Compare January 14, 2024 23:13
@mwiencek
Copy link
Member Author

mwiencek commented Jan 14, 2024

I renamed cachelock to recently_invalidated to make it clear that it's not an actual lock. (Let me know if you prefer the previous name, or have a better suggestion.)

I also made some changes which should help resolve the race condition described above:

  • Instead of checking for a recently_invalidated key before we add something to the cache, we check it immediately after we add something to the cache, and just delete the added entry right away if it was in fact recently-invalidated. (We could also check it before, but I feel that there may be more of a downside to adding too many cache lookups to handle an already-rare occurrence.) So, there is a very brief interval where a stale version of the entity may be returned, but it's limited to sub-milliseconds rather than whatever the ENTITY_CACHE_TTL is!
  • The recently_invalidated keys are no longer deleted after the deleting transaction ends; this was a mistake because there may still be stale inserting transactions open after the deleting one commits. Instead, we always leave them to expire after the max request time has elapsed.

A test has been added for the scenario described in #3130 (comment), and another commit has been added to follow-up on an issue I found for #2751 (comment).

@mwiencek mwiencek changed the base branch from master to production January 14, 2024 23:34
@mwiencek
Copy link
Member Author

Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.

I don't think anything special needs to be done, actually, since there is nothing currently in the cache that would conflict with this PR. It's only making use of some new keys being added.

After deploying this to beta, there is a minor chance that the type of issue described by MBS-7241 can arise -- either a deleted entity remaining in the cache, or the cache returning an outdated version of an entity -- until it's also deployed to production.

@yvanzo
Copy link
Contributor

yvanzo commented Jan 15, 2024

Do you now intend to deploy it with the usual beta rather than to deploy it to both beta and production at the same time?
Should it be tracked with a separate ticket?

@mwiencek mwiencek changed the title Reimplement MBS-7241 without database-level locks MBS-13449: Reimplement transactional cache without database locks Jan 16, 2024
@mwiencek
Copy link
Member Author

mwiencek commented Jan 16, 2024

Do you now intend to deploy it with the usual beta rather than to deploy it to both beta and production at the same time?
Should it be tracked with a separate ticket?

Added MBS-13449. I think deploying it to production still makes sense, since if any cache issues arise, it might otherwise be unclear if it's due to a bug in the code, or due to a lack of symmetry between the code on production and beta.

@mwiencek mwiencek force-pushed the nix-advisory-locks branch 3 times, most recently from 52cf980 to b816524 Compare January 23, 2024 15:37
@mwiencek
Copy link
Member Author

https://ci.metabrainz.org/job/musicbrainz-server/10292/ passed (which was accidentally based on master), and there is another test run queued based on production. (But I think the above run is sufficient to show that tests work now.)


my $args2 = DBDefs->DATASTORE_REDIS_ARGS;
$args2->{database} = $args2->{test_database} + 1;
$args2->{database} = DBDefs->REDIS_TEST_DATABASE + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this here, should the docs in DBDefs say that both REDIS_TEST_DATABASE and REDIS_TEST_DATABASE + 1 will be completely erased on each test run? Or should we define REDIS_SECOND_TEST_DATABASE? You know someone will use 0 for test and 1 for production some day.

Copy link
Member Author

@mwiencek mwiencek Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should just mention that any Redis database numbers greater than or equal to REDIS_TEST_DATABASE may be used for tests.

Edit: Updated the text in lib/DBDefs/Default.pm; let me know if that's better.

@mwiencek mwiencek force-pushed the nix-advisory-locks branch 3 times, most recently from 27d6772 to 2e8b887 Compare January 31, 2024 15:17
@mwiencek
Copy link
Member Author

(Ignore the pushed diffs: I've accidentally rebased on master several times when this PR is based on production, and had to revert the changes.)

`Data::Role::EntityCache` expects all consumers' `merge` methods to have a
signature of `($self, $new_id, @old_ids)`, particularly in `after merge` where
`_delete_from_cache` is called. Artist was the only model whose signature did
not match, which caused the `EntityCache` code to attempt to delete keys like
`artist:ARRAY(0x1234567)` and whatever is in `%opts`.

I changed the existing method name to `merge_with_opts`, and added a new
`merge` implementation that matches other entities (and can only be called
without options).

I expected that this would fix a bug where the "old" IDs were not being deleted
for artists in `after merge`. However, this commit doesn't seem to fix any bug,
because the old IDs are deleted separately by `_delete_and_redirect_gids` in
`Data::Artist::merge` (now `merge_with_opts`); the only issue being fixed here
is us attempting to delete bogus keys from Redis, but that did not cause any
other visible issue.
There is another variable declared as `$data` below, which is only
distinguished from `%data` by the sigil.
`_create_cache_entries` takes a hash ref, so we can just pass the ref through
this way. Nothing in `_add_to_cache` requires a hash.
@mwiencek mwiencek force-pushed the nix-advisory-locks branch 2 times, most recently from a5f376b to 0293336 Compare February 6, 2024 01:22
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added tests using nested transactions, because race conditions would be very difficult to reproduce otherwise!

Found minor oversights in text mostly but still have one question:

Any reason not to mark the artist as recently invalidated in cache even before updating it in the database? That would further reduce the chances to have a cached stale artist.

lib/MusicBrainz/Server/Data/Artist.pm Show resolved Hide resolved
t/lib/t/Context.pm Outdated Show resolved Hide resolved
t/lib/t/MusicBrainz/Server/Data/Artist.pm Outdated Show resolved Hide resolved
lib/DBDefs/Default.pm Outdated Show resolved Hide resolved
t/lib/t/MusicBrainz/Server/Data/Role/GIDEntityCache.pm Outdated Show resolved Hide resolved
t/lib/t/MusicBrainz/Server/Data/Role/GIDEntityCache.pm Outdated Show resolved Hide resolved
t/lib/t/MusicBrainz/Server/Data/Role/GIDEntityCache.pm Outdated Show resolved Hide resolved
@mwiencek
Copy link
Member Author

mwiencek commented Feb 6, 2024

Any reason not to mark the artist as recently invalidated in cache even before updating it in the database? That would further reduce the chances to have a cached stale artist.

IIUC this is already the case: in _delete_from_cache, set_multi(@invalidated_flags) (paraphrasing) is already called before delete_multi(@ids); and _delete_from_cache itself is called before the database transaction which updates/deletes the entity commits.

This moves the calculation of the cache prefix (`$self->_type .  ':'`) to a
static attribute.
I find "abc" to be more confusing than it needs to be when trying to debug
these tests, because I don't immediately recognize it as a GID lookup.
The list of IDs to cache may be truncated by `$MAX_CACHE_ENTRIES`, so calling
`keys %$data` will not be equivalent everywhere. This commit is purely
refactoring, but allows consistent access to the truncated IDs in a future
commit.
If we build multiple contexts, I would expect them to use separate connectors
without having to build those manually. This is needed in an upcoming commit
which adds a `GIDEntityCache` test making use of multiple connectors.
(Refactoring only.) The code being replaced here is basically duplicating what
`cache_aware_c` already provides.
When using `cache_aware_c` in the tests, we currently don't clear the Redis
test database between test cases (similar to how we rollback `$c->sql`).

We also don't appear to be initializing the Redis store to use the
`test_database` number.

This commit adds a separate configuration option for the minimum Redis test
database number to DBDefs.pm, so that we can clear the correct database number
for both the cache and the store after each test case.
This is intended to help resolve deadlocks (MBS-11345, MBS-12314) and
performance issues due to lock contention.

It supplants metabrainz#212.

The new implementation is documented in `Data::Role::EntityCache`.
If you call `get_by_gid` with a redirected gid, we'll cache the resolved gid of
the entity, but not the redirected one that was passed in. This fixes that.
Initialize the Redis cache/store in the normal `c` instead of requiring access
to a separate `cache_aware_c` in the tests. Remove the mock caches, and just
use the Redis caches, which is a better test of real code; in fact it revealed
at least one bug where we don't invalidate releases from the cache when their
ASINs are updated (fixed in the prior commit).

I didn't see any point to the `EntityCache` and `GIDEntityCache` tests having
to check how many times certain mock methods were called. It had no bearing on
what ended up being stored in the cache, which was all that mattered.
Copy link
Contributor

@yvanzo yvanzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed replies, pushed changes and commit messages, all good to go! 🚢

@mwiencek mwiencek merged commit 7ed2a0a into metabrainz:production Feb 8, 2024
3 checks passed
@mwiencek mwiencek deleted the nix-advisory-locks branch February 8, 2024 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants