Skip to content

Scope: Fix TOCTOU race in prefix initialization#43

Closed
djmetzle wants to merge 4 commits into
masterfrom
scope-race-condition-fix
Closed

Scope: Fix TOCTOU race in prefix initialization#43
djmetzle wants to merge 4 commits into
masterfrom
scope-race-condition-fix

Conversation

@djmetzle
Copy link
Copy Markdown
Contributor

@djmetzle djmetzle commented Apr 7, 2026

Backend::getAndSet() has Time-of-Check-Time-of-Use (TOCTOU) race conditions. It performs non-atomic get() then set(), so concurrent callers that both observe a miss will each compute and write different values. The last writer wins, silently orphaning data written by earlier callers under the first value.

Add a regression test documenting the race conditions, and the new expected behavior to handle concurrent writes.

We can use add() instead of set() when setting backend values so the first writer wins. If add() fails, re-get() to pick up the winner's value. The $reset path (deleteScope) still uses set() for intentional overwrites.

Note this addresses the race for both the Scope prefix, as well as getAndSet.

Ref:

CC @sctice-ifixit @danielbeardsley

djmetzle added 2 commits April 7, 2026 13:22
Use add() instead of set() when initializing the scope prefix so the
first writer wins. If add() fails, re-get() to pick up the winner's
value. The $reset path (deleteScope) still uses set() for intentional
overwrites.
A get() then set() allows a slow callback's result to be overwritten
by a concurrent caller. Use add() so the first writer wins. Try to fetch
the first writers value, if possible, but fall back to the computed value
if the get() returns a miss.
@djmetzle djmetzle added the bug label Apr 7, 2026
If we cannot `add`, we try to fetch the first writers value. If that
fails, make sure to always return a scope prefix. This is the same
failure mode that was fixed for `getAndSet`, which also always needs to
return a value.

Note: we can clean up the getAndSet test to be a bit more intent
revealing, using the same pattern found for the prefix test.
$this->set($key, $value, $expiration);
if ($reset) {
$this->set($key, $value, $expiration);
} else if (!$this->add($key, $value, $expiration)) {
Copy link
Copy Markdown
Member

@danielbeardsley danielbeardsley Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does as claimed, but I feel like this could cause problems with some usage patterns:

  • DCG (where we short-circuit all GETs and return MISS)
    • This change would fail to update the cache
  • McRouter: how does it handle add() when one instance has a value and the other doesn't?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a usage error. DCG doesn't use the reset option then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, DCG is a backend.

Seems like we'd want to sub away set and add there, as we do in the new tests here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's confusing, no, im incorrect. DCG is intended to repopulate the cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand the failure mode you're describing? I'm reviewing DCG, and it seems like it would continue to work as expected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see any problems with DCG or Mcrouter. This should help fix the race condition in both. And behavior is otherwise preserved.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand the failure mode you're describing? I'm reviewing DCG, and it seems like it would continue to work as expected.

I think the scenario is:

  • Prior to DCG request, getAndSet(K, () => V1) sets K ⇒ V1
  • On DCG request, we try to getAndSet(K, () => V2):
    • get(K) => MISS because of the DCG backend wrap on get
    • Because of MISS, we run $value = $callback() and get V2
    • Not $reset, so we try to add(K, V2, TTL)
    • But K is in the cache, so add fails
    • So we get(K) => V1 and return V1

We expected to write V2 to the cache and return it in the DCG request (simulating the cache actually starting empty), but instead we wrote nothing and got back V1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I see the concern. Missed that add will see the current value.

Doesn't that mean we need DCG to explicitly reset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this too:

/**
* Override the `set` method. Use `add` which is synchronous to detect
* `set` over-top of existing keys. Delete and reset them to
* enforce consistency.
*/
public function set($key, $value, $expiration = 0) {
$addReturn = $this->memcached->add($key, $value, $expiration);

@djmetzle
Copy link
Copy Markdown
Contributor Author

djmetzle commented Apr 7, 2026

Some things Claude flagged reviewing this:

  1. Stats profile shifts. The Stats wrapper tracks set_count and add_count separately. Before this change, every getAndSet miss showed up as a set. Now it shows up as an add (and a get on race-loss). If anyone is monitoring set_count or add_count specifically, the numbers will change. Your Cache::getMemcacheStats() aggregates these so it probably doesn't matter, but worth noting.

  2. getAndSetMultiple still uses setMultiple(). Same TOCTOU problem, but there's no addMultiple() primitive in the Backend interface, so we can't fix it the same way. Not a regression — it was already racy — but it's now inconsistent with getAndSet.

@danielbeardsley
Copy link
Copy Markdown
Member

The last writer wins, silently orphaning data written by earlier callers under the first value.

This seems to be the behavior this pull is trying alter. I'm tempted to say we should narrow the focus here and just make this change for Scopes. I feel like that would reduce the chance of breaking current behavior and address the problem that the issue talks about.

Outside of scopes, this doesn't seem like a big deal to me, the later SET wins. But I see how it could play poorly with Scopes where we are storing a random prefix, not a cached version of some external source of truth.

@djmetzle
Copy link
Copy Markdown
Contributor Author

djmetzle commented Apr 7, 2026

Is the trouble though that Scope relies on getAndSet's behavior? We need to fix both to fully address the race:

public function getScopePrefix(bool $reset = false) {
if ($this->scopePrefix === null || $reset) {
$scopeValue = $this->backend->getAndSet($this->getScopeKey(),
function() {
return substr(md5(microtime() . $this->scopeName), 0, 16);

@djmetzle
Copy link
Copy Markdown
Contributor Author

djmetzle commented Apr 7, 2026

Actually the getAndSet fix also addresses the scope problem?

With the race condition also addressed in `getAndSet`, we can now safely
rely on it for concurrent scope initialization. Revert back to the
original version.
@sterlinghirsh
Copy link
Copy Markdown
Member

Responding to the original issue here:

The scope prefix is also cached in $this->scopePrefix (an instance variable), so within a single request the stale prefix persists even after it's been overwritten in the backend — causing all subsequent reads to miss for the rest of that request.

In the case you're describing, two simultaneous requests both hit a cold cache. Workers A and B both generate scope keys. But any time you generate a scope key, all subsequent requests would be misses for the rest of that request anyway since that request is responsible for populating that scope.

First writer wins. After add(), re-get() to learn the winning value.

In what situation do you want first writer to win? I'd think the later writer would have the more up to date information if it came down to that.

silently orphaning data written by earlier callers under the first value.

So the bug here is wasted writes? What is the consequence of that? Are we getting untimely cache evictions?

Is getAndSet intended to be atomic? The docstring doesn't say, but the name and usage pattern (lazy cache population) strongly implies it

It's impossible for this to be atomic. The whole point is that you get first, and then if you need to revalidate then you compute the value, and then you set. Computing the value is the expensive part, and this does nothing to mitigate overlapping revalidations, but that wasn't the intent of getAndSet.

Let's say you start 2 processes computing a few keys of scoped data under your proposal.

A: GET scope key - MISS - start generating scope prefix A
B: GET scope key - MISS - start generating scope prefix B
A: ADD scope key - SUCCESS - GET value key A - MISS - start generating value A
B: ADD scope key - FAIL - GET scope key again - HIT - GET value key A - MISS - start generating value B
A: ADD value key A - SUCCESS - RETURN value A
B: ADD value key A - FAIL - GET value key A again - HIT - RETURN value A

vs on master:

A: GET scope key - MISS - start generating scope prefix A
B: GET scope key - MISS - start generating scope prefix B
A: SET scope key - GET value key A - MISS - start generating value A
B: SET scope key - GET value key B - MISS - start generating value B
A: SET value key A - RETURN value A
B: SET value key B - RETURN value B

This seems like it will increase our cache traffic and I'm not sure what the advantage of having first writer win over last writer if both are doing all the computation anyway. It seems like this is intended to solve a bug but I'm curious what the actual behavior leading to this was.

In a perfect world, maybe there would be a way to do a getOrLock, either getting the value or telling the cache you intend to place a value in that key. Then subsequent requests for that key are held (e.g. blocking network request) until the value is set by the original process so that the result can be instantly distributed to everyone waiting for it. You can do something like this in mysql. I think there is a way to do this with redis too but I'm not sure. Doubt it for apcu / memcache. One problem is what happens if the first process never comes back. With mysql, it can end the transaction eventually and let the next client awaiting a lock have one. In memcache / apcu you probably just have to have a timeout at which point the process starts revalidating anyway.

@djmetzle
Copy link
Copy Markdown
Contributor Author

djmetzle commented Apr 9, 2026

These seem like questions for the issue/spec @sterlinghirsh. Did you see the other PR? This version should probably be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants