xds: Fix cluster selection races when updating config selector #11947

ejona86 · 2025-03-06T20:42:25Z

Listener2.onResult() doesn't require running in the sync context, so when called from the sync context it is guaranteed not to do its processing immediately (instead, it schedules work into the sync context).

The code was doing an update dance: 1) update service config to add new cluster, 2) update config selector to use new cluster, 3) update service config to remove old clusters. But the onResult() wasn't being processed immediately, so the actual execution order was 2, 1, 3 which has a small window where RPCs will fail. But onResult2() does run immediately. And since ca4819a, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig before swapping to the new config selector, so RPCs could fail saying there was no route instead of the useful error message. Even with the opposite order, some RPCs may be executing the while loop of selectConfig(), trying to acquire a cluster. The code unreffed the clusters before updating the routingConfig, so those RPCs could go into a tight loop until the routingConfig was updated. Also, once the routingConfig was updated to EMPTY those RPCs would similarly see the wrong error message. To give the correct error message, selectConfig() must fail such RPCs directly, and once it can do that there's no need to stop using the config selector in error cases. This has the benefit of fewer moving parts and more consistent threading among cases.

The added test was able to detect the race 2% of the time. The slower the code/machine, the more reliable the test failed. ca4819a along with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611

This is the followup of #11945 I mentioned.

CC @kannanjgithub, @danielzhaotongliu

Listener2.onResult() doesn't require running in the sync context, so when called from the sync context it is guaranteed not to do its processing immediately (instead, it schedules work into the sync context). The code was doing an update dance: 1) update service config to add new cluster, 2) update config selector to use new cluster, 3) update service config to remove old clusters. But the onResult() wasn't being processed immediately, so the actual execution order was 2, 1, 3 which has a small window where RPCs will fail. But onResult2() does run immediately. And since ca4819a, updateBalancingState() updates the picker immediately. cleanUpRoutes() was also racy because it updated the routingConfig before swapping to the new config selector, so RPCs could fail saying there was no route instead of the useful error message. Even with the opposite order, some RPCs may be executing the while loop of selectConfig(), trying to acquire a cluster. The code unreffed the clusters before updating the routingConfig, so those RPCs could go into a tight loop until the routingConfig was updated. Also, once the routingConfig was updated to EMPTY those RPCs would similarly see the wrong error message. To give the correct error message, selectConfig() must fail such RPCs directly, and once it can do that there's no need to stop using the config selector in error cases. This has the benefit of fewer moving parts and more consistent threading among cases. The added test was able to detect the race 2% of the time. The slower the code/machine, the more reliable the test failed. ca4819a along with this commit reduced it to 0 failures in 1000 runs. Discovered when investigating b/394850611

xds/src/test/java/io/grpc/xds/FakeControlPlaneXdsIntegrationTest.java

xds/src/main/java/io/grpc/xds/XdsNameResolver.java

xds/src/test/java/io/grpc/xds/FakeControlPlaneXdsIntegrationTest.java

danielzhaotongliu

LGTM (don't have permissions to resolve discussions).

ejona86 requested a review from larry-safran March 6, 2025 20:42

danielzhaotongliu reviewed Mar 6, 2025

View reviewed changes

xds/src/test/java/io/grpc/xds/FakeControlPlaneXdsIntegrationTest.java Show resolved Hide resolved

xds/src/main/java/io/grpc/xds/XdsNameResolver.java Outdated Show resolved Hide resolved

xds/src/main/java/io/grpc/xds/XdsNameResolver.java Show resolved Hide resolved

larry-safran approved these changes Mar 6, 2025

View reviewed changes

xds/src/test/java/io/grpc/xds/FakeControlPlaneXdsIntegrationTest.java Outdated Show resolved Hide resolved

ejona86 added 2 commits March 6, 2025 13:48

s/status/errorStatus/

e48fc94

s/newBuilder().build()/getDefaultInstance()/

436ecbc

ejona86 requested a review from danielzhaotongliu March 6, 2025 21:50

danielzhaotongliu approved these changes Mar 6, 2025

View reviewed changes

ejona86 merged commit d82613a into grpc:master Mar 7, 2025
16 checks passed

ejona86 deleted the xds-cluster-race branch March 7, 2025 18:33

github-actions bot locked as resolved and limited conversation to collaborators Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xds: Fix cluster selection races when updating config selector #11947

xds: Fix cluster selection races when updating config selector #11947

Uh oh!

ejona86 commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielzhaotongliu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xds: Fix cluster selection races when updating config selector #11947

xds: Fix cluster selection races when updating config selector #11947

Uh oh!

Conversation

ejona86 commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielzhaotongliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants