New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MBS-13449: Reimplement transactional cache without database locks #3130
Conversation
fbdd10f
to
f61e349
Compare
BTW, there is at least one minor race condition here:
However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTMBDNT. The comments to the code are very explanatory. I don’t know how we could reproduce potential race conditions in CI tests, so it’s fine to skip for now.
Is there anything special to be done when deploying on both instances? Like clearing some caches? Please document these extra steps if any somewhere.
BTW, there is at least one minor race condition here:
* Request A reads artist ID=1 from the database. It's not in the cache, and there is no cachelock key. * Request B starts before request A adds artist ID=1 to the cache. It updates or deletes artist ID=1 and fully commits. * Request A finally adds (the old version of) artist ID=1 to the cache.
However, the window of time for this to happen is extremely narrow, so is probably more acceptable than the current situation. I'll try to think of ways to mitigate this, too.
Would it work to require request A to acquire the cachelock before loading the entity from the database?
@yvanzo Thanks for the review; actually I think I found a way to reproduce the race condition in the tests, and was working on a fix, but haven't pushed the changes yet.
Right, I will think about this more and add some documentation.
It would work if it was an exclusive lock, I think, though it's what I'm trying to move away from to avoid lock contention: the "deleting" transaction doesn't actually read the cachelock key at all, only sets it in order to temporarily prevent any "reading" transaction from adding to the cache. There is one thing that I think would help, and that is https://redis.io/docs/interact/transactions/#watch-explained which we could use to The only problem with this is that the current implementation here stores the cachelock keys in our KeyDB store rather than the cache. Watching a key in the store will have no effect on the cache. So we'd have to move the cachelock keys to the cache, which seems iffy, since it means they can theoretically be evicted before we're done with them. However, perhaps this isn't the issue it seems, since we use the LRU eviction policy and it should be very unlikely that it's evicted in the time span we're talking about. It would be interesting if we had some stats regarding "average age of keys when they're evicted." |
f61e349
to
34393b1
Compare
8ebce73
to
b2d4b82
Compare
I renamed I also made some changes which should help resolve the race condition described above:
A test has been added for the scenario described in #3130 (comment), and another commit has been added to follow-up on an issue I found for #2751 (comment). |
b2d4b82
to
98b0c85
Compare
I don't think anything special needs to be done, actually, since there is nothing currently in the cache that would conflict with this PR. It's only making use of some new keys being added. After deploying this to beta, there is a minor chance that the type of issue described by MBS-7241 can arise -- either a deleted entity remaining in the cache, or the cache returning an outdated version of an entity -- until it's also deployed to production. |
Do you now intend to deploy it with the usual |
98b0c85
to
4c43d61
Compare
4c43d61
to
f981f2d
Compare
Added MBS-13449. I think deploying it to production still makes sense, since if any cache issues arise, it might otherwise be unclear if it's due to a bug in the code, or due to a lack of symmetry between the code on production and beta. |
52cf980
to
b816524
Compare
https://ci.metabrainz.org/job/musicbrainz-server/10292/ passed (which was accidentally based on master), and there is another test run queued based on production. (But I think the above run is sufficient to show that tests work now.) |
|
||
my $args2 = DBDefs->DATASTORE_REDIS_ARGS; | ||
$args2->{database} = $args2->{test_database} + 1; | ||
$args2->{database} = DBDefs->REDIS_TEST_DATABASE + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this here, should the docs in DBDefs say that both REDIS_TEST_DATABASE
and REDIS_TEST_DATABASE + 1
will be completely erased on each test run? Or should we define REDIS_SECOND_TEST_DATABASE
? You know someone will use 0 for test and 1 for production some day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should just mention that any Redis database numbers greater than or equal to REDIS_TEST_DATABASE
may be used for tests.
Edit: Updated the text in lib/DBDefs/Default.pm; let me know if that's better.
27d6772
to
2e8b887
Compare
(Ignore the pushed diffs: I've accidentally rebased on master several times when this PR is based on production, and had to revert the changes.) |
2e8b887
to
455bab7
Compare
`Data::Role::EntityCache` expects all consumers' `merge` methods to have a signature of `($self, $new_id, @old_ids)`, particularly in `after merge` where `_delete_from_cache` is called. Artist was the only model whose signature did not match, which caused the `EntityCache` code to attempt to delete keys like `artist:ARRAY(0x1234567)` and whatever is in `%opts`. I changed the existing method name to `merge_with_opts`, and added a new `merge` implementation that matches other entities (and can only be called without options). I expected that this would fix a bug where the "old" IDs were not being deleted for artists in `after merge`. However, this commit doesn't seem to fix any bug, because the old IDs are deleted separately by `_delete_and_redirect_gids` in `Data::Artist::merge` (now `merge_with_opts`); the only issue being fixed here is us attempting to delete bogus keys from Redis, but that did not cause any other visible issue.
There is another variable declared as `$data` below, which is only distinguished from `%data` by the sigil.
`_create_cache_entries` takes a hash ref, so we can just pass the ref through this way. Nothing in `_add_to_cache` requires a hash.
a5f376b
to
0293336
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added tests using nested transactions, because race conditions would be very difficult to reproduce otherwise!
Found minor oversights in text mostly but still have one question:
Any reason not to mark the artist as recently invalidated in cache even before updating it in the database? That would further reduce the chances to have a cached stale artist.
IIUC this is already the case: in |
This moves the calculation of the cache prefix (`$self->_type . ':'`) to a static attribute.
I find "abc" to be more confusing than it needs to be when trying to debug these tests, because I don't immediately recognize it as a GID lookup.
The list of IDs to cache may be truncated by `$MAX_CACHE_ENTRIES`, so calling `keys %$data` will not be equivalent everywhere. This commit is purely refactoring, but allows consistent access to the truncated IDs in a future commit.
If we build multiple contexts, I would expect them to use separate connectors without having to build those manually. This is needed in an upcoming commit which adds a `GIDEntityCache` test making use of multiple connectors.
(Refactoring only.) The code being replaced here is basically duplicating what `cache_aware_c` already provides.
When using `cache_aware_c` in the tests, we currently don't clear the Redis test database between test cases (similar to how we rollback `$c->sql`). We also don't appear to be initializing the Redis store to use the `test_database` number. This commit adds a separate configuration option for the minimum Redis test database number to DBDefs.pm, so that we can clear the correct database number for both the cache and the store after each test case.
This is intended to help resolve deadlocks (MBS-11345, MBS-12314) and performance issues due to lock contention. It supplants metabrainz#212. The new implementation is documented in `Data::Role::EntityCache`.
If you call `get_by_gid` with a redirected gid, we'll cache the resolved gid of the entity, but not the redirected one that was passed in. This fixes that.
Initialize the Redis cache/store in the normal `c` instead of requiring access to a separate `cache_aware_c` in the tests. Remove the mock caches, and just use the Redis caches, which is a better test of real code; in fact it revealed at least one bug where we don't invalidate releases from the cache when their ASINs are updated (fixed in the prior commit). I didn't see any point to the `EntityCache` and `GIDEntityCache` tests having to check how many times certain mock methods were called. It had no bearing on what ended up being stored in the cache, which was all that mattered.
0293336
to
f4d3a5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed replies, pushed changes and commit messages, all good to go! 🚢
Problem
The current implementation for MBS-7241 causes deadlocks and performance issues due to lock contention (MBS-11345, MBS-12314).
Solution
Implement a different solution that blocks cache additions using our Redis store to manage "cachelock" keys.
The new implementation is documented in
Data::Role::EntityCache
. Please review that to make sure it seems sound.Testing
The existing test for MBS-7241 in
GIDEntityCache
passes. However, this hasn't been tested at scale. Complicating things, it also needs to be deployed to production and beta at the same time.