fix: fix bug with concurrent writes to global cache by chrisrossi · Pull Request #705 · googleapis/python-ndb

chrisrossi · 2021-08-05T19:09:38Z

Fixes #692

chrisrossi · 2021-08-05T19:27:26Z

+        pass
+
+
+def test_global_cache_concurrent_writes_692(in_context):


I don't really like this test. I thought in the course of doing this bug fix I'd try to lay some groundwork for #691, with the goal being a pattern that can be used test various concurrency scenarios. While I did get this to work, I'm not crazy about it because:

It wound up being very "white box", requiring knowledge of what tasklets would be called by global_lock_for_write and global_lock_for_unwrite.

It's brittle. It's almost just dumb luck that I could write a test that both reproduced the problem before the fix and verified the fix. The fix could have wound up changing what these functions do internally enough that this wouldn't have been possible, given how much they depend on the internals.

It's difficult to reason about. See how much verbiage goes into just trying to explain what's going on to someone reading the test. We generally don't want tests to require this level of explication. I also spent a lot more time writing the test, than I did implementing the fix.

I've left the test here for now, so that it can maybe be used as a jumping off point for inventing a better pattern. I'm open to suggestions.

Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Because in order to expose the bug, you need a specific timing of things happening in parallel calls to global_lock_for_write and global_unlock_for_write, and you can't reproduce that by just calling them serially. I think I did manage to find a combination of calls inside a tasklet that reproduced the problem, but my worry is that it wouldn't be evident to anyone reading the code why it reproduced the problem.

After having done the fully deterministic approach, though, maybe I should have stopped there.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.

Because in order to expose the bug, you need a specific timing of things happening in parallel calls to global_lock_for_write and global_unlock_for_write, and you can't reproduce that by just calling them serially.

True

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.

Sure, it just seems that the watch management in _InProcessGlobalCache is broken and likely to cause problems.

How is it broken? It's supposed to be a decent reference implementation. At any rate, calling out to Redis or Memcached would move this over into the system test realm. I'm still not, overall, happy with where this test is, so I'll be thinking about how to make this better under the auspices of #691 Thank you for your insights. As always, they are appreciated.

How is it broken?

The watches don't keep track of clients and all clients share the same _watch_keys dict. So if I watch foo and then you watch foo, my watch will be overwritten by yours.

jimfulton

IIUC the gist of this is to avoid deleting the locked key when the last write lock is removed. The rest (quite a bit) is just tolerating the value 00.

I think this can be simplified by writing an empty string in that case, rather than '00', and changing some is None checks to truthiness checks.

Let _update_key handle the distinction between '' and None. (Adding the optimization of passing in an initial old value to avoid a get that was just done.)

jimfulton · 2021-08-07T15:12:09Z

+    if prev_value == _LOCKED_FOR_WRITE:
+        yield global_watch(key, prev_value)
+        lock_acquired = yield global_compare_and_swap(key, lock, expires=_LOCK_TIME)
+    else:
+        lock_acquired = yield global_set_if_not_exists(key, lock, expires=_LOCK_TIME)
+


I wonder if this can be simplified by expanding update-value a tad to take an initial old value and then using that here.

If you're referring to _update_key, that gets called from the lock/unlock methods for write locks, which are called from _datastore_api.put, which doesn't do a global_get, so we haven't already gotten the value yet when that gets called.

Sorry, yes _update_key. _update_key would have to change to accept an initial value. While this is only a few lines of code, there is a lot of context to keep track of. Not repeating it, and adding some comments explaining why we use a watch/CAS in one case and set-if-not-empty in the other would be beneficial, IMO.

This is just a suggestion. Feel free to ignore. :)

Oh, I see. I was confused because we aren't even using that here, but you're suggesting that may be we should...

After looking at it for a minute, I think I'm going to leave it alone.

jimfulton · 2021-08-07T15:38:53Z

+        pass
+
+
+def test_global_cache_concurrent_writes_692(in_context):


Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.

Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.

chrisrossi · 2021-08-10T19:50:56Z

IIUC the gist of this is to avoid deleting the locked key when the last write lock is removed. The rest (quite a bit) is just tolerating the value 00.

I think this can be simplified by writing an empty string in that case, rather than '00', and changing some is None checks to truthiness checks.

Let _update_key handle the distinction between '' and None. (Adding the optimization of passing in an initial old value to avoid a get that was just done.)

I've implemented this in bdec054 . I'm not sure that it's vastly simplified. If you prefer it this way, we can keep it, though.

jimfulton · 2021-08-10T19:54:33Z

I'm not sure that it's vastly simplified. If you prefer it this way, we can keep it, though.

I wouldn't say vastly, but I think it's simpler. I like it. :)

fix: fix bug with concurrent writes to global cache

3de59fc

Fixes googleapis#692

chrisrossi requested a review from jimfulton August 5, 2021 19:09

chrisrossi requested a review from andrewsg as a code owner August 5, 2021 19:09

chrisrossi requested review from a team August 5, 2021 19:09

google-cla Bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 5, 2021

product-auto-label Bot added the api: datastore Issues related to the googleapis/python-ndb API. label Aug 5, 2021

chrisrossi commented Aug 5, 2021

View reviewed changes

jimfulton reviewed Aug 7, 2021

View reviewed changes

Chris Rossi added 2 commits August 10, 2021 15:01

Correct docstring.

227d19b

Use empty bytes for empty write lock value.

bdec054

jimfulton approved these changes Aug 10, 2021

View reviewed changes

chrisrossi merged commit bb7cadc into googleapis:master Aug 11, 2021

chrisrossi deleted the fix-692 branch August 11, 2021 15:00

tseaver mentioned this pull request Oct 18, 2021

TypeError: argument of type 'NoneType' is not iterable #728

Closed

		pass


		def test_global_cache_concurrent_writes_692(in_context):

Conversation

chrisrossi commented Aug 5, 2021

Uh oh!

chrisrossi Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisrossi Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimfulton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisrossi commented Aug 10, 2021

Uh oh!

jimfulton commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisrossi Aug 5, 2021 •

edited

Loading

chrisrossi Aug 11, 2021 •

edited

Loading