A71: xDS Fallback #386

eugeneo · 2023-08-15T22:27:37Z

No description provided.

A71-xds-fallback.md

markdroth

This looks like a good start!

Please let me know if you have any questions. Thanks!

A71-xds-fallback.md

markdroth · 2023-08-23T17:29:42Z

CC @ginayeh @larry-safran

larry-safran · 2023-08-24T22:24:59Z

A71-xds-fallback.md

+Current xDS implementation in gRPC uses a single global instance of
+the XdsClient class to communicate with the xDS control plane. This instance
+is shared across all channels to reduce overhead by providing a shared cache
+for xDS resources as well as reducing a number of connections to xDS servers.


s/reducing a number/reducing the number/

larry-safran · 2023-08-24T22:39:06Z

A71-xds-fallback.md

+due to spike in demand once the primary server goes down and all the clients
+try to refetch the resources.
+
+#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache.


How is this different from the chosen approach?

In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.

larry-safran · 2023-08-24T23:03:17Z

A71-xds-fallback.md

+gRPC code should use fallback iff both:
+
+* primary server not reachable
+* at least one watcher exists for a resource that is not cached (where


There are 2 cases where a watcher could exist for a non-cached resource

Didn't get a response before primary server became unreachable

Primary server was already unreachable and a new request for a resource is made

The code paths for these situations seems like they would be quite different

I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.

A71-xds-fallback.md

markdroth · 2023-08-28T19:54:34Z

A71-xds-fallback.md

+gRPC code should use fallback iff both:
+
+* primary server not reachable
+* at least one watcher exists for a resource that is not cached (where


I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.

A71-xds-fallback.md

eugeneo

Please take another look.

A71-xds-fallback.md

eugeneo · 2023-08-29T19:27:09Z

A71-xds-fallback.md

+due to spike in demand once the primary server goes down and all the clients
+try to refetch the resources.
+
+#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache.


In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.

A71-xds-fallback.md

eugeneo

I believe I addressed your comments. Please take another look.

A71-xds-fallback.md

markdroth

The structure looks much better now!

Please let me know if you have any questions. Thanks!

A71-xds-fallback.md

larry-safran · 2023-12-13T00:13:55Z

A71-xds-fallback.md

+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to use the resources
+the same server


s/use the resources the same/only use the resources from a single/

A71-xds-fallback.md

larry-safran · 2023-12-13T00:19:44Z

A71-xds-fallback.md

+below).
+
+Changes XdsClient to per-data plane target would enable gRPC to switch to
+fallback configuration only for channels that need xDS resource that have not


s/resource/resources/

larry-safran · 2023-12-13T00:26:12Z

A71-xds-fallback.md

+
+The fallback process is initiated if both of the following conditions hold:
+
+* There is a connectivity failure on the ADS stream, as described in [A57]:


Need to clarify between it being either or both conditions.

"either" 😀

larry-safran · 2023-12-13T00:27:49Z

A71-xds-fallback.md

+plane becomes unavailable after a resource change notification was received.
+
+XdsClients will need to be changed to support multiple ADS connections for each
+authority. Once the fallback process begins, impacted XdsClient will establish


s/impacted/an impacted/

A71-xds-fallback.md

larry-safran · 2023-12-13T00:30:18Z

A71-xds-fallback.md

+
+Connecting to the lower-priority servers does not close gRPC connections to the
+higher-priority servers. XdsClient will still wait for xDS resources on the ADS
+stream. Once such resource is received, XdsClient will close connections to the


s/resource is received, XdsClient/a resource is received, the XdsClient/

eugeneo

Thanks for the comments. Updated!

eugeneo · 2023-12-13T17:38:00Z

A71-xds-fallback.md

+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to use the resources
+the same server


A71-xds-fallback.md

eugeneo · 2023-12-13T17:43:42Z

A71-xds-fallback.md

+below).
+
+Changes XdsClient to per-data plane target would enable gRPC to switch to
+fallback configuration only for channels that need xDS resource that have not


A71-xds-fallback.md

easwars · 2023-12-18T20:22:33Z

A71-xds-fallback.md

+
+Current xDS implementation in gRPC uses a single global instance of
+the XdsClient class to communicate with the xDS control plane. This instance
+is shared across all channels to reduce overhead by providing a shared cache


Nit: This instance is shared across all channels and all xDS-enabled gRPC server listeners.

easwars · 2023-12-18T20:23:45Z

A71-xds-fallback.md

+
+### Reservations about using the fallback server data
+
+We are expecting to use this in an environment where the data provided by


Would it make sense to give an example of such an environment? And in what environments would be the converse be true, i.e data from primary and fallback management servers are equally favorable?

Added a clarification

easwars · 2023-12-18T20:25:58Z

A71-xds-fallback.md

+
+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to only use the resources


Same here. Not just any given gRPC channel, but any given xDS-enabled gRPC server listener.

Maybe we can introduce a new term (something like a gRPC endpoint) to encapsulate both xds-enabled gRPC channel and xds-enabled grpc servers, and use that term wherever applicable.

easwars · 2023-12-18T20:26:10Z

A71-xds-fallback.md

+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to only use the resources
+the single server


Nit: terminate with a period.

A71-xds-fallback.md

easwars · 2023-12-18T20:33:11Z

A71-xds-fallback.md

+gRPC servers using the xDS configuration will share the same XdsClient instance
+keyed with a dedicated well-known key value.


Java and Go support multiple listeners for the same xds-enabled gRPC server. This means that the LDS resource requested by each of these listeners will be different. Instead of using a well-known key and sharing the XdsClient across all these listeners, would it make sense to key it based on the server listener resource name? This would mean that there will be no sharing of XdsClient across server listeners though.

And the same issue can show up in C-core as well, if the application contains more than one xds-enabled grpc server.

@markdroth I am not familiar with this feature. Any advice?

Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.

It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.

For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.

easwars · 2023-12-18T20:34:25Z

A71-xds-fallback.md

+fallback configuration only for channels that need xDS resources that have not
+been downloaded yet.
+
+This change would change the channels to get XdsClient for their data plane


The CSDS change seems independent enough to deserve a separate sub-section.

easwars · 2023-12-18T20:51:26Z

A71-xds-fallback.md

+XdsClients will need to be changed to support multiple ADS connections for each
+authority. Once the fallback process begins, an impacted XdsClient will establish
+a connection to the next xDS control plane listed in the bootstrap JSON. Then
+XdsClient will subscribe to all watched resources on that server and will
+update the cache based on the received responses.


Can we add more details here with example configurations?

It is not clear to me what happens in the following scenario

a grpc channel has received its full configuration and is serving traffic

a config update adds a new cluster in the route table

at the same time the xDS channel moves to TRANSIENT_FAILURE

Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE

Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?

Switch happens as soon as a connection to primary server is reestablished.

It is not clear to me what happens in the following scenario

a grpc channel has received its full configuration and is serving traffic

a config update adds a new cluster in the route table

at the same time the xDS channel moves to TRANSIENT_FAILURE

(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)

Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.

The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.

The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.

accidental

eugeneo

Integrated the comments.

eugeneo · 2024-01-02T18:52:06Z

A71-xds-fallback.md

+
+Current xDS implementation in gRPC uses a single global instance of
+the XdsClient class to communicate with the xDS control plane. This instance
+is shared across all channels to reduce overhead by providing a shared cache


eugeneo · 2024-01-02T18:55:03Z

A71-xds-fallback.md

+
+### Reservations about using the fallback server data
+
+We are expecting to use this in an environment where the data provided by


Added a clarification

eugeneo · 2024-01-02T18:56:04Z

A71-xds-fallback.md

+
+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to only use the resources


eugeneo · 2024-01-02T18:56:21Z

A71-xds-fallback.md

+We have no guarantee that a combination of resources from different xDS servers
+form a valid cohesive configuration, so we cannot make this determination on
+a per-resource basis. We need any given gRPC channel to only use the resources
+the single server


eugeneo · 2024-01-02T18:57:22Z

A71-xds-fallback.md

+gRPC servers using the xDS configuration will share the same XdsClient instance
+keyed with a dedicated well-known key value.


@markdroth I am not familiar with this feature. Any advice?

eugeneo · 2024-01-02T19:06:13Z

A71-xds-fallback.md

+fallback configuration only for channels that need xDS resources that have not
+been downloaded yet.
+
+This change would change the channels to get XdsClient for their data plane


eugeneo · 2024-01-02T19:15:16Z

A71-xds-fallback.md

+XdsClients will need to be changed to support multiple ADS connections for each
+authority. Once the fallback process begins, an impacted XdsClient will establish
+a connection to the next xDS control plane listed in the bootstrap JSON. Then
+XdsClient will subscribe to all watched resources on that server and will
+update the cache based on the received responses.


a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE

Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?

Switch happens as soon as a connection to primary server is reestablished.

A71-xds-fallback.md

larry-safran · 2024-01-03T18:16:43Z

A71-xds-fallback.md

+
+Current gRPC xDS implementations only use the first xDS configuration server
+listed in the bootstrap JSON document. Fallback implementation requires changes
+to use all servers listed in order from highest to lowest priority.


Would be good to specify that the order establishes the priority with the first being the highest priority.

markdroth

This is looking really good overall! Just a few remaining comments to address.

Please let me know if you have any questions. Thanks!

markdroth · 2024-01-03T18:22:47Z

A71-xds-fallback.md

+gRPC servers using the xDS configuration will share the same XdsClient instance
+keyed with a dedicated well-known key value.


Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.

It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.

For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.

A71-xds-fallback.md

markdroth · 2024-01-03T19:00:20Z

A71-xds-fallback.md

+XdsClients will need to be changed to support multiple ADS connections for each
+authority. Once the fallback process begins, an impacted XdsClient will establish
+a connection to the next xDS control plane listed in the bootstrap JSON. Then
+XdsClient will subscribe to all watched resources on that server and will
+update the cache based on the received responses.


It is not clear to me what happens in the following scenario

a grpc channel has received its full configuration and is serving traffic

a config update adds a new cluster in the route table

at the same time the xDS channel moves to TRANSIENT_FAILURE

(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)

Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.

The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.

The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.

eugeneo

Thank you for the reviews.

A71-xds-fallback.md

eugeneo · 2024-01-03T22:17:59Z

A71-xds-fallback.md

+
+Current gRPC xDS implementations only use the first xDS configuration server
+listed in the bootstrap JSON document. Fallback implementation requires changes
+to use all servers listed in order from highest to lowest priority.


A71-xds-fallback.md

markdroth

Looks great! Just one minor comment remaining.

A71-xds-fallback.md

markdroth · 2024-01-09T18:08:39Z

Note: This PR was incorrectly merged before the gRFC had been approved. We'll review offline and make any subsequent changes in #407.

ejona86

I have a comment about clarity, and we're already discussing the desire to define what string servers use for the key. With those, this LGTM.

ejona86 · 2024-01-17T21:11:12Z

A71-xds-fallback.md

+gRPC servers using the xDS configuration will share the same XdsClient instance
+keyed with a dedicated well-known key value.
+
+Changes XdsClient to per-data plane target would enable gRPC to switch to


This looks to be missing a subject. Was this intended to be "Changing XdsClient..."

That's unclear though. Is that a theoretical, "if we made this change then we'd get this result," or is it "we are making this change to get this result." Maybe something like:

gRPC Channel will share XdsClient instances keyed by the data plane target , which would enable..."

And then maybe the "This change would change..." paragraph gets merged in here. Right now it is unclear what "this change" is referring to.

Please take a loog at #407 where I tried addressing your comment.

eugeneo force-pushed the fallback-proposal branch from 3877c6f to b476e6f Compare August 18, 2023 17:17

eugeneo requested a review from markdroth August 18, 2023 17:18

eugeneo changed the title ~~xDS fallback proposal~~ A71: xDS Fallback Aug 18, 2023

eugeneo commented Aug 18, 2023

View reviewed changes

A71-xds-fallback.md Outdated Show resolved Hide resolved

eugeneo commented Aug 18, 2023

View reviewed changes

A71-xds-fallback.md Outdated Show resolved Hide resolved

A71: xDS Fallback

020bf70

eugeneo force-pushed the fallback-proposal branch from 0fb096d to 020bf70 Compare August 18, 2023 17:37

eugeneo marked this pull request as ready for review August 18, 2023 17:37

Update number

eeb35f9

murgatroid99 reviewed Aug 22, 2023

View reviewed changes

A71-xds-fallback.md Outdated Show resolved Hide resolved

A71-xds-fallback.md Outdated Show resolved Hide resolved

markdroth reviewed Aug 22, 2023

View reviewed changes

fixup

1b5b7ea

larry-safran reviewed Aug 24, 2023

View reviewed changes

eugeneo added 4 commits August 25, 2023 10:28

fixup

7514fc9

fixup

f338bd9

wip

3805037

fixup

35ce9d8

markdroth reviewed Aug 28, 2023

View reviewed changes

eugeneo added 6 commits August 28, 2023 16:38

Fixup: updates.

4b3cfcb

updates

1a63fa6

updates

48e3939

updates

f0761cc

updates

410e6c5

updates

036a95f

eugeneo commented Aug 29, 2023

View reviewed changes

dfawley reviewed Aug 31, 2023

View reviewed changes

fixup

ec5c0f8

eugeneo added 2 commits November 16, 2023 11:31

Merge remote-tracking branch 'grpc/master' into fallback-proposal

0dd7bc4

fixup: Updated after offlline discussions.

5f3f2cb

markdroth reviewed Dec 4, 2023

View reviewed changes

eugeneo added 2 commits December 11, 2023 16:38

fixup: review comments

ed4ed8a

Review comments

8a4169a

eugeneo commented Dec 12, 2023

View reviewed changes

eugeneo added 2 commits December 12, 2023 10:07

Review comments

b7c10c8

grammar

b0a6bd8

markdroth reviewed Dec 12, 2023

View reviewed changes

larry-safran reviewed Dec 13, 2023

View reviewed changes

fixup: address comments

c143dce

eugeneo commented Dec 14, 2023

View reviewed changes

easwars previously approved these changes Dec 18, 2023

View reviewed changes

fixup: comments

e4a6582

eugeneo commented Jan 2, 2024

View reviewed changes

larry-safran approved these changes Jan 3, 2024

View reviewed changes

markdroth reviewed Jan 3, 2024

View reviewed changes

fixup: code review comments

e7480a4

eugeneo commented Jan 3, 2024

View reviewed changes

markdroth reviewed Jan 4, 2024

View reviewed changes

A71-xds-fallback.md Outdated Show resolved Hide resolved

fixup: last comment

f81ad8d

eugeneo merged commit eb0d8fc into master Jan 4, 2024
1 check passed

eugeneo deleted the fallback-proposal branch January 4, 2024 23:50

ejona86 reviewed Jan 17, 2024

View reviewed changes

eugeneo mentioned this pull request Jan 18, 2024

A71 update: Finish review #407

Merged


		The fallback process is initiated if both of the following conditions hold:

		* There is a connectivity failure on the ADS stream, as described in [A57]:


		### Reservations about using the fallback server data

		We are expecting to use this in an environment where the data provided by

		gRPC servers using the xDS configuration will share the same XdsClient instance
		keyed with a dedicated well-known key value.

A71: xDS Fallback #386

A71: xDS Fallback #386

Conversation

eugeneo commented Aug 15, 2023

markdroth left a comment

Choose a reason for hiding this comment

markdroth commented Aug 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eugeneo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eugeneo left a comment

Choose a reason for hiding this comment

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eugeneo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eugeneo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eugeneo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth left a comment

Choose a reason for hiding this comment

markdroth commented Jan 9, 2024

ejona86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment