Skip to content
This repository was archived by the owner on May 6, 2026. It is now read-only.

fix: fix bug with concurrent writes to global cache#705

Merged
chrisrossi merged 3 commits into
googleapis:masterfrom
chrisrossi:fix-692
Aug 11, 2021
Merged

fix: fix bug with concurrent writes to global cache#705
chrisrossi merged 3 commits into
googleapis:masterfrom
chrisrossi:fix-692

Conversation

@chrisrossi
Copy link
Copy Markdown
Contributor

Fixes #692

@chrisrossi chrisrossi requested a review from jimfulton August 5, 2021 19:09
@chrisrossi chrisrossi requested a review from andrewsg as a code owner August 5, 2021 19:09
@chrisrossi chrisrossi requested review from a team August 5, 2021 19:09
@google-cla google-cla Bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 5, 2021
@product-auto-label product-auto-label Bot added the api: datastore Issues related to the googleapis/python-ndb API. label Aug 5, 2021
pass


def test_global_cache_concurrent_writes_692(in_context):
Copy link
Copy Markdown
Contributor Author

@chrisrossi chrisrossi Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this test. I thought in the course of doing this bug fix I'd try to lay some groundwork for #691, with the goal being a pattern that can be used test various concurrency scenarios. While I did get this to work, I'm not crazy about it because:

  • It wound up being very "white box", requiring knowledge of what tasklets would be called by global_lock_for_write and global_lock_for_unwrite.
  • It's brittle. It's almost just dumb luck that I could write a test that both reproduced the problem before the fix and verified the fix. The fix could have wound up changing what these functions do internally enough that this wouldn't have been possible, given how much they depend on the internals.
  • It's difficult to reason about. See how much verbiage goes into just trying to explain what's going on to someone reading the test. We generally don't want tests to require this level of explication. I also spent a lot more time writing the test, than I did implementing the fix.

I've left the test here for now, so that it can maybe be used as a jumping off point for inventing a better pattern. I'm open to suggestions.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Because in order to expose the bug, you need a specific timing of things happening in parallel calls to global_lock_for_write and global_unlock_for_write, and you can't reproduce that by just calling them serially. I think I did manage to find a combination of calls inside a tasklet that reproduced the problem, but my worry is that it wouldn't be evident to anyone reading the code why it reproduced the problem.

After having done the fully deterministic approach, though, maybe I should have stopped there.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in order to expose the bug, you need a specific timing of things happening in parallel calls to global_lock_for_write and global_unlock_for_write, and you can't reproduce that by just calling them serially.

True

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.

Sure, it just seems that the watch management in _InProcessGlobalCache is broken and likely to cause problems.

Copy link
Copy Markdown
Contributor Author

@chrisrossi chrisrossi Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it broken? It's supposed to be a decent reference implementation. At any rate, calling out to Redis or Memcached would move this over into the system test realm. I'm still not, overall, happy with where this test is, so I'll be thinking about how to make this better under the auspices of #691 Thank you for your insights. As always, they are appreciated.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it broken?

The watches don't keep track of clients and all clients share the same _watch_keys dict. So if I watch foo and then you watch foo, my watch will be overwritten by yours.

Copy link
Copy Markdown

@jimfulton jimfulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the gist of this is to avoid deleting the locked key when the last write lock is removed. The rest (quite a bit) is just tolerating the value 00.

I think this can be simplified by writing an empty string in that case, rather than '00', and changing some is None checks to truthiness checks.

Let _update_key handle the distinction between '' and None. (Adding the optimization of passing in an initial old value to avoid a get that was just done.)

Comment thread google/cloud/ndb/_cache.py
Comment thread google/cloud/ndb/_cache.py Outdated
Comment on lines +605 to +610
if prev_value == _LOCKED_FOR_WRITE:
yield global_watch(key, prev_value)
lock_acquired = yield global_compare_and_swap(key, lock, expires=_LOCK_TIME)
else:
lock_acquired = yield global_set_if_not_exists(key, lock, expires=_LOCK_TIME)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be simplified by expanding update-value a tad to take an initial old value and then using that here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to _update_key, that gets called from the lock/unlock methods for write locks, which are called from _datastore_api.put, which doesn't do a global_get, so we haven't already gotten the value yet when that gets called.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes _update_key. _update_key would have to change to accept an initial value. While this is only a few lines of code, there is a lot of context to keep track of. Not repeating it, and adding some comments explaining why we use a watch/CAS in one case and set-if-not-empty in the other would be beneficial, IMO.

This is just a suggestion. Feel free to ignore. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I was confused because we aren't even using that here, but you're suggesting that may be we should...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at it for a minute, I think I'm going to leave it alone.

pass


def test_global_cache_concurrent_writes_692(in_context):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

@chrisrossi
Copy link
Copy Markdown
Contributor Author

IIUC the gist of this is to avoid deleting the locked key when the last write lock is removed. The rest (quite a bit) is just tolerating the value 00.

I think this can be simplified by writing an empty string in that case, rather than '00', and changing some is None checks to truthiness checks.

Let _update_key handle the distinction between '' and None. (Adding the optimization of passing in an initial old value to avoid a get that was just done.)

I've implemented this in bdec054 . I'm not sure that it's vastly simplified. If you prefer it this way, we can keep it, though.

@jimfulton
Copy link
Copy Markdown

I'm not sure that it's vastly simplified. If you prefer it this way, we can keep it, though.

I wouldn't say vastly, but I think it's simpler. I like it. :)

@chrisrossi chrisrossi merged commit bb7cadc into googleapis:master Aug 11, 2021
@chrisrossi chrisrossi deleted the fix-692 branch August 11, 2021 15:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api: datastore Issues related to the googleapis/python-ndb API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError: 'NoneType' object has no attribute 'replace'

2 participants