clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

easwars · 2023-07-06T23:32:46Z

EDS nacks and resource-not-found errors are to be treated by the cluster_resolver LB policy as though it received an EDS update with no endpoints. This would mean that for aggregate clusters, it would result in fallback (if appropriate), and non-aggregate clusters, it would result in the channel moving to TRANSIENT_FAILURE (because the child policy, priority LB, will put the channel in TRANSIENT_FAILURE when there are no priorities in its configuration).

The only difference between how a NACK error and a resource-not-found error is handled by the cluster_resolver LB policy is that in the NACK case, the LB policy ignores the error if there was a previous result. The resource-not-found error is received by the cluster_resolver LB policy only when it does not have a previously received good update. This is because, the xds client will not propagate this error to its watchers (because EDS is not state-of-the-world) if it already has the resource in its cache.

Fixes #6217

RELEASE NOTES:

clusterresolver: handle EDS nacks and resource-not-found errors correctly

…ctly

zasweq · 2023-07-07T22:28:43Z

xds/internal/balancer/clusterresolver/resource_resolver_eds.go

+	}
+
+	// If a previously received good configuration exists, continue to use it.
+	// Else report an empty update that would result in the child policy


"child priority of priority corresponding to EDS reporting Transient Failure due to the recipient of an empty update." also drop the no priorities or localities, there would still be a hierarchy of priorities, but one of them will report TF. It's actually technically weighted target that triggers the TF, so maybe say causing weighted_target balancer down the xDS tree for the EDS priority...

As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.

Oh ok I see, I just remember having to add TF back from weighted target if addr list is zero in UpdateClientConnState. That's prob why I was under the impression the system was two children with top priority child and second priority child, and first sends TF back. The first one just not being created makes sense too haha :).

xds/internal/balancer/clusterresolver/resource_resolver_eds.go

zasweq

Looks good overall just minor nits :). Spurred some interesting discussion with Eric.

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

zasweq · 2023-07-08T01:04:05Z

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

+	xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{
+		XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address),
+		NodeProto: &v3corepb.Node{Id: nodeID},
+	}, defaultTestWatchExpiryTimeout, time.Duration(0))


Optional, mainly musing. The test above polls for a second, and this requires 500 milliseconds to invoke the resource not found error. Is there a way to write this that doesn't introduce testing time as such? I don't see a way around it but curious your thoughts.

The test above polls for a second.

I don't understand what you are referring to here.

Is there a way to write this that doesn't introduce testing time as such?

For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.

Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.

I don't see a way around it but curious your thoughts.

I don't see a way either :)

Wrt timer...could do something like: https://github.com/grpc/grpc-go/blob/master/xds/internal/balancer/outlierdetection/balancer.go#L50 for time.AfterFunc, or https://github.com/grpc/grpc-go/pull/4270/files#diff-cfb871b425a217deb8602c43a41b39ed1378776d5a40bebee381fec0f231a011R51 for a newTimer. But yeah, prob not worth it in this scenario. I also was thinking setting the 500 ms to something like 10ms since I don't think it matters the length.

zasweq · 2023-07-08T01:07:32Z

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

+// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the
+// management server sends a bad update (one that is NACKed by the xDS client).
+// Since the cluster_resolver LB policy does not have a previously received good
+// update, it is expected to treat this bad update as though it received an
+// update with no endpoints. Hence RPCs are expected to fail with "all
+// priorities removed" error.


I'm having trouble following why we want this error to be "AllPrioritiesRemoved". The priority still exists right? I.e. in the failover case where you have two children top level priority reports TF causing failover. Thus, why is it when there's only one discovery mechanism it doesn't have the same behavior, but has specifically this error? Don't we want a TF from weighted target? Or does that cause this error? Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.

Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.

Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).

Haha, right. As per my other reply, I was under the impression that the system was "Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE". Thus, I made this comment as the error did not make sense to me. The explanation of how the system works sgtm and is cleaner, and the error makes sense. Curious though, is this scoped in a gRFC? Or just an implementation detail per language not explicitly defined. We should I think shoot for implementation consistency between other languages.

We can discuss this in the chat room to see what the other languages do and whether this is scoped out in any gRFC.

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

zasweq

going to approve since overall this looked good to me, and I'm taking leave. Please get to the comments though :).

easwars

Thanks for the review.

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

easwars · 2023-07-11T02:11:49Z

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

+	xdsClient, close, err := xdsclient.NewWithConfigForTesting(&bootstrap.Config{
+		XDSServer: xdstestutils.ServerConfigForAddress(t, mgmtServer.Address),
+		NodeProto: &v3corepb.Node{Id: nodeID},
+	}, defaultTestWatchExpiryTimeout, time.Duration(0))


The test above polls for a second.

I don't understand what you are referring to here.

Is there a way to write this that doesn't introduce testing time as such?

For RDS/EDS, the only way to force a resource-not-found error is by the watch timer firing. And since this is an e2e test, there is no way for us to inject such an event other than actually waiting for it to happen. And the only thing we can do here is to pass a really low value for the watch timeout.

Also, the code towards the end of this test, where it verifies fallback does not actually depend on any timings. It simply verifies that an RPC gets routed to a backend in the DNS cluster before the test context expires.

I don't see a way around it but curious your thoughts.

I don't see a way either :)

xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go

easwars · 2023-07-11T02:21:03Z

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

+// TestEDS_BadUpdateWithoutPreviousGoodUpdate tests the case where the
+// management server sends a bad update (one that is NACKed by the xDS client).
+// Since the cluster_resolver LB policy does not have a previously received good
+// update, it is expected to treat this bad update as though it received an
+// update with no endpoints. Hence RPCs are expected to fail with "all
+// priorities removed" error.


In this case, the EDS discovery mechanism sends an empty xdsresource.EndpointsUpdate to the cluster_resolver LB policy, and this results in a configuration that does not contain priorities. When such a configuration is sent to the priority LB policy, this is error that we expect.

Do you see any downsides to doing this? Or do you see any advantages of sending a configuration with one priority (but no endpoints) to the priority LB policy and letting weighted_target set the channel to TRANSIENT_FAILURE. I don't see any benefits of doing so, but would like to hear your opinion on it.

Or do we just not create a priority child in one discovery mechanism, but with two you create and send down empty update?

That is not true. Even in the aggregate case, if the EDS discovery mechanism returns an emoty xdsresource.Endpoints update, the generated config will not have a priority child for the EDS discovery mechanism. (Note that , it will continue to have the EDS child if there was a previously received good response).

xds/internal/balancer/clusterresolver/e2e_test/eds_impl_test.go

easwars · 2023-07-11T02:33:38Z

xds/internal/balancer/clusterresolver/resource_resolver_eds.go

+	}
+
+	// If a previously received good configuration exists, continue to use it.
+	// Else report an empty update that would result in the child policy


As mentioned in the other review comment, an empty update would result in that child not being created. I updated the comment anyways, making it reflect reality more than it was previously doing.

xds/internal/balancer/clusterresolver/resource_resolver_eds.go

easwars

Thanks for the review.

clusterresolver: handle EDS nacks and resource-not-found errors corre…

06851fe

…ctly

easwars requested a review from zasweq July 6, 2023 23:33

easwars assigned zasweq Jul 6, 2023

easwars added the Type: Bug label Jul 6, 2023

easwars added this to the 1.57 Release milestone Jul 6, 2023

zasweq reviewed Jul 7, 2023

View reviewed changes

xds/internal/balancer/clusterresolver/resource_resolver_eds.go Outdated Show resolved Hide resolved

zasweq reviewed Jul 8, 2023

View reviewed changes

xds/internal/balancer/clusterresolver/resource_resolver_eds.go Show resolved Hide resolved

zasweq requested changes Jul 8, 2023

View reviewed changes

zasweq assigned easwars and unassigned zasweq Jul 8, 2023

zasweq approved these changes Jul 10, 2023

View reviewed changes

easwars commented Jul 11, 2023

View reviewed changes

review comments

14e466f

easwars merged commit bf5b7ae into grpc:master Jul 11, 2023
11 checks passed

easwars deleted the clusterresolver_fix_6217 branch July 11, 2023 02:56

github-actions bot locked as resolved and limited conversation to collaborators Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

easwars commented Jul 6, 2023 •

edited

zasweq Jul 7, 2023 •

edited

easwars Jul 11, 2023

zasweq Jul 12, 2023

zasweq left a comment

zasweq Jul 8, 2023

easwars Jul 11, 2023

zasweq Jul 12, 2023

zasweq Jul 8, 2023

easwars Jul 11, 2023

zasweq Jul 12, 2023

easwars Jul 12, 2023

zasweq left a comment

easwars left a comment

easwars Jul 11, 2023

easwars Jul 11, 2023

easwars Jul 11, 2023

easwars left a comment

clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

clusterresolver: handle EDS nacks and resource-not-found errors correctly #6436

Conversation

easwars commented Jul 6, 2023 • edited

zasweq Jul 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq left a comment

Choose a reason for hiding this comment

easwars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easwars left a comment

Choose a reason for hiding this comment

easwars commented Jul 6, 2023 •

edited

zasweq Jul 7, 2023 •

edited