fix: fix bug with concurrent writes to global cache#705
Conversation
| pass | ||
|
|
||
|
|
||
| def test_global_cache_concurrent_writes_692(in_context): |
There was a problem hiding this comment.
I don't really like this test. I thought in the course of doing this bug fix I'd try to lay some groundwork for #691, with the goal being a pattern that can be used test various concurrency scenarios. While I did get this to work, I'm not crazy about it because:
- It wound up being very "white box", requiring knowledge of what tasklets would be called by
global_lock_for_writeandglobal_lock_for_unwrite. - It's brittle. It's almost just dumb luck that I could write a test that both reproduced the problem before the fix and verified the fix. The fix could have wound up changing what these functions do internally enough that this wouldn't have been possible, given how much they depend on the internals.
- It's difficult to reason about. See how much verbiage goes into just trying to explain what's going on to someone reading the test. We generally don't want tests to require this level of explication. I also spent a lot more time writing the test, than I did implementing the fix.
I've left the test here for now, so that it can maybe be used as a jumping off point for inventing a better pattern. I'm open to suggestions.
There was a problem hiding this comment.
Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.
Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.
There was a problem hiding this comment.
Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.
Because in order to expose the bug, you need a specific timing of things happening in parallel calls to global_lock_for_write and global_unlock_for_write, and you can't reproduce that by just calling them serially. I think I did manage to find a combination of calls inside a tasklet that reproduced the problem, but my worry is that it wouldn't be evident to anyone reading the code why it reproduced the problem.
After having done the fully deterministic approach, though, maybe I should have stopped there.
Speaking of, I'm not sure
_InProcessGlobalCacheis sufficient here. It seems that different cache clients can stomp on each-other's watches.
I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.
There was a problem hiding this comment.
Because in order to expose the bug, you need a specific timing of things happening in parallel calls to
global_lock_for_writeandglobal_unlock_for_write, and you can't reproduce that by just calling them serially.
True
Speaking of, I'm not sure
_InProcessGlobalCacheis sufficient here. It seems that different cache clients can stomp on each-other's watches.I managed to get the same exception as in the original issue using the diagnosis I had made looking at a live instance using Redis.
Sure, it just seems that the watch management in _InProcessGlobalCache is broken and likely to cause problems.
There was a problem hiding this comment.
How is it broken? It's supposed to be a decent reference implementation. At any rate, calling out to Redis or Memcached would move this over into the system test realm. I'm still not, overall, happy with where this test is, so I'll be thinking about how to make this better under the auspices of #691 Thank you for your insights. As always, they are appreciated.
There was a problem hiding this comment.
How is it broken?
The watches don't keep track of clients and all clients share the same _watch_keys dict. So if I watch foo and then you watch foo, my watch will be overwritten by yours.
jimfulton
left a comment
There was a problem hiding this comment.
IIUC the gist of this is to avoid deleting the locked key when the last write lock is removed. The rest (quite a bit) is just tolerating the value 00.
I think this can be simplified by writing an empty string in that case, rather than '00', and changing some is None checks to truthiness checks.
Let _update_key handle the distinction between '' and None. (Adding the optimization of passing in an initial old value to avoid a get that was just done.)
| if prev_value == _LOCKED_FOR_WRITE: | ||
| yield global_watch(key, prev_value) | ||
| lock_acquired = yield global_compare_and_swap(key, lock, expires=_LOCK_TIME) | ||
| else: | ||
| lock_acquired = yield global_set_if_not_exists(key, lock, expires=_LOCK_TIME) | ||
|
|
There was a problem hiding this comment.
I wonder if this can be simplified by expanding update-value a tad to take an initial old value and then using that here.
There was a problem hiding this comment.
If you're referring to _update_key, that gets called from the lock/unlock methods for write locks, which are called from _datastore_api.put, which doesn't do a global_get, so we haven't already gotten the value yet when that gets called.
There was a problem hiding this comment.
Sorry, yes _update_key. _update_key would have to change to accept an initial value. While this is only a few lines of code, there is a lot of context to keep track of. Not repeating it, and adding some comments explaining why we use a watch/CAS in one case and set-if-not-empty in the other would be beneficial, IMO.
This is just a suggestion. Feel free to ignore. :)
There was a problem hiding this comment.
Oh, I see. I was confused because we aren't even using that here, but you're suggesting that may be we should...
There was a problem hiding this comment.
After looking at it for a minute, I think I'm going to leave it alone.
| pass | ||
|
|
||
|
|
||
| def test_global_cache_concurrent_writes_692(in_context): |
There was a problem hiding this comment.
Rather than fighting time, why not take the approach took in my last commit on #667? Just make cache calls to represent scenarios you care about, being careful that you use different cache clients for different logical clients.
Speaking of, I'm not sure _InProcessGlobalCache is sufficient here. It seems that different cache clients can stomp on each-other's watches.
I've implemented this in bdec054 . I'm not sure that it's vastly simplified. If you prefer it this way, we can keep it, though. |
I wouldn't say vastly, but I think it's simpler. I like it. :) |
Fixes #692