Skip to content

Conversation

@ejona86
Copy link
Member

@ejona86 ejona86 commented Mar 6, 2025

Listener2.onResult() doesn't require running in the sync context, so when called from the sync context it is guaranteed not to do its processing immediately (instead, it schedules work into the sync context).

The code was doing an update dance: 1) update service config to add new cluster, 2) update config selector to use new cluster, 3) update service config to remove old clusters. But the onResult() wasn't being processed immediately, so the actual execution order was 2, 1, 3 which has a small window where RPCs will fail. But onResult2() does run immediately. And since ca4819a, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig before swapping to the new config selector, so RPCs could fail saying there was no route instead of the useful error message. Even with the opposite order, some RPCs may be executing the while loop of selectConfig(), trying to acquire a cluster. The code unreffed the clusters before updating the routingConfig, so those RPCs could go into a tight loop until the routingConfig was updated. Also, once the routingConfig was updated to EMPTY those RPCs would similarly see the wrong error message. To give the correct error message, selectConfig() must fail such RPCs directly, and once it can do that there's no need to stop using the config selector in error cases. This has the benefit of fewer moving parts and more consistent threading among cases.

The added test was able to detect the race 2% of the time. The slower the code/machine, the more reliable the test failed. ca4819a along with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611


This is the followup of #11945 I mentioned.

CC @kannanjgithub, @danielzhaotongliu

Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).

The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819a, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.

The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819a along
with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611
@ejona86 ejona86 requested a review from larry-safran March 6, 2025 20:42
Copy link
Collaborator

@danielzhaotongliu danielzhaotongliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (don't have permissions to resolve discussions).

@ejona86 ejona86 merged commit d82613a into grpc:master Mar 7, 2025
16 checks passed
@ejona86 ejona86 deleted the xds-cluster-race branch March 7, 2025 18:33
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants