Scope: Fix TOCTOU race in prefix initialization#43
Conversation
Use add() instead of set() when initializing the scope prefix so the first writer wins. If add() fails, re-get() to pick up the winner's value. The $reset path (deleteScope) still uses set() for intentional overwrites.
A get() then set() allows a slow callback's result to be overwritten by a concurrent caller. Use add() so the first writer wins. Try to fetch the first writers value, if possible, but fall back to the computed value if the get() returns a miss.
If we cannot `add`, we try to fetch the first writers value. If that fails, make sure to always return a scope prefix. This is the same failure mode that was fixed for `getAndSet`, which also always needs to return a value. Note: we can clean up the getAndSet test to be a bit more intent revealing, using the same pattern found for the prefix test.
| $this->set($key, $value, $expiration); | ||
| if ($reset) { | ||
| $this->set($key, $value, $expiration); | ||
| } else if (!$this->add($key, $value, $expiration)) { |
There was a problem hiding this comment.
This does as claimed, but I feel like this could cause problems with some usage patterns:
- DCG (where we short-circuit all GETs and return MISS)
- This change would fail to update the cache
- McRouter: how does it handle
add()when one instance has a value and the other doesn't?
There was a problem hiding this comment.
That seems like a usage error. DCG doesn't use the reset option then?
There was a problem hiding this comment.
I see, DCG is a backend.
Seems like we'd want to sub away set and add there, as we do in the new tests here.
There was a problem hiding this comment.
Wow that's confusing, no, im incorrect. DCG is intended to repopulate the cache.
There was a problem hiding this comment.
Help me understand the failure mode you're describing? I'm reviewing DCG, and it seems like it would continue to work as expected.
There was a problem hiding this comment.
I dont see any problems with DCG or Mcrouter. This should help fix the race condition in both. And behavior is otherwise preserved.
There was a problem hiding this comment.
Help me understand the failure mode you're describing? I'm reviewing DCG, and it seems like it would continue to work as expected.
I think the scenario is:
- Prior to DCG request,
getAndSet(K, () => V1)setsK ⇒ V1 - On DCG request, we try to
getAndSet(K, () => V2):get(K) => MISSbecause of the DCG backend wrap onget- Because of
MISS, we run$value = $callback()and getV2 - Not
$reset, so we try toadd(K, V2, TTL) - But
Kis in the cache, soaddfails - So we
get(K) => V1and returnV1
We expected to write V2 to the cache and return it in the DCG request (simulating the cache actually starting empty), but instead we wrote nothing and got back V1.
There was a problem hiding this comment.
Right! I see the concern. Missed that add will see the current value.
Doesn't that mean we need DCG to explicitly reset?
There was a problem hiding this comment.
There was a problem hiding this comment.
Found this too:
Matryoshka/library/iFixit/Matryoshka/McRouter.php
Lines 21 to 27 in ab81f01
|
Some things Claude flagged reviewing this:
|
This seems to be the behavior this pull is trying alter. I'm tempted to say we should narrow the focus here and just make this change for Scopes. I feel like that would reduce the chance of breaking current behavior and address the problem that the issue talks about. Outside of scopes, this doesn't seem like a big deal to me, the later SET wins. But I see how it could play poorly with Scopes where we are storing a random prefix, not a cached version of some external source of truth. |
|
Is the trouble though that Matryoshka/library/iFixit/Matryoshka/Scope.php Lines 24 to 28 in ab81f01 |
|
Actually the |
With the race condition also addressed in `getAndSet`, we can now safely rely on it for concurrent scope initialization. Revert back to the original version.
|
Responding to the original issue here:
In the case you're describing, two simultaneous requests both hit a cold cache. Workers A and B both generate scope keys. But any time you generate a scope key, all subsequent requests would be misses for the rest of that request anyway since that request is responsible for populating that scope.
In what situation do you want first writer to win? I'd think the later writer would have the more up to date information if it came down to that.
So the bug here is wasted writes? What is the consequence of that? Are we getting untimely cache evictions?
It's impossible for this to be atomic. The whole point is that you get first, and then if you need to revalidate then you compute the value, and then you set. Computing the value is the expensive part, and this does nothing to mitigate overlapping revalidations, but that wasn't the intent of getAndSet. Let's say you start 2 processes computing a few keys of scoped data under your proposal. A: GET scope key - MISS - start generating scope prefix A vs on master: A: GET scope key - MISS - start generating scope prefix A This seems like it will increase our cache traffic and I'm not sure what the advantage of having first writer win over last writer if both are doing all the computation anyway. It seems like this is intended to solve a bug but I'm curious what the actual behavior leading to this was. In a perfect world, maybe there would be a way to do a getOrLock, either getting the value or telling the cache you intend to place a value in that key. Then subsequent requests for that key are held (e.g. blocking network request) until the value is set by the original process so that the result can be instantly distributed to everyone waiting for it. You can do something like this in mysql. I think there is a way to do this with redis too but I'm not sure. Doubt it for apcu / memcache. One problem is what happens if the first process never comes back. With mysql, it can end the transaction eventually and let the next client awaiting a lock have one. In memcache / apcu you probably just have to have a timeout at which point the process starts revalidating anyway. |
|
These seem like questions for the issue/spec @sterlinghirsh. Did you see the other PR? This version should probably be closed. |
Backend::getAndSet() has Time-of-Check-Time-of-Use (TOCTOU) race conditions. It performs non-atomic
get()thenset(), so concurrent callers that both observe a miss will each compute and write different values. The last writer wins, silently orphaning data written by earlier callers under the first value.Add a regression test documenting the race conditions, and the new expected behavior to handle concurrent writes.
We can use
add()instead ofset()when setting backend values so the first writer wins. Ifadd()fails, re-get()to pick up the winner's value. The$resetpath (deleteScope) still usesset()for intentional overwrites.Note this addresses the race for both the Scope prefix, as well as
getAndSet.Ref:
CC @sctice-ifixit @danielbeardsley