Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A71: xDS Fallback #386

Merged
merged 24 commits into from Jan 4, 2024
Merged

A71: xDS Fallback #386

merged 24 commits into from Jan 4, 2024

Conversation

eugeneo
Copy link
Contributor

@eugeneo eugeneo commented Aug 15, 2023

No description provided.

@eugeneo eugeneo requested a review from markdroth August 18, 2023 17:18
@eugeneo eugeneo changed the title xDS fallback proposal A71: xDS Fallback Aug 18, 2023
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
@eugeneo eugeneo marked this pull request as ready for review August 18, 2023 17:37
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start!

Please let me know if you have any questions. Thanks!

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
@markdroth
Copy link
Member

CC @ginayeh @larry-safran

Current xDS implementation in gRPC uses a single global instance of
the XdsClient class to communicate with the xDS control plane. This instance
is shared across all channels to reduce overhead by providing a shared cache
for xDS resources as well as reducing a number of connections to xDS servers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reducing a number/reducing the number/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

due to spike in demand once the primary server goes down and all the clients
try to refetch the resources.

#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the chosen approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.

gRPC code should use fallback iff both:

* primary server not reachable
* at least one watcher exists for a resource that is not cached (where

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 cases where a watcher could exist for a non-cached resource

  1. Didn't get a response before primary server became unreachable
  2. Primary server was already unreachable and a new request for a resource is made

The code paths for these situations seems like they would be quite different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
gRPC code should use fallback iff both:

* primary server not reachable
* at least one watcher exists for a resource that is not cached (where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@eugeneo eugeneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
due to spike in demand once the primary server goes down and all the clients
try to refetch the resources.

#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@eugeneo eugeneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I addressed your comments. Please take another look.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure looks much better now!

Please let me know if you have any questions. Thanks!

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to use the resources
the same server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/use the resources the same/only use the resources from a single/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A71-xds-fallback.md Outdated Show resolved Hide resolved
below).

Changes XdsClient to per-data plane target would enable gRPC to switch to
fallback configuration only for channels that need xDS resource that have not

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/resource/resources/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


The fallback process is initiated if both of the following conditions hold:

* There is a connectivity failure on the ADS stream, as described in [A57]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify between it being either or both conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"either" 😀

plane becomes unavailable after a resource change notification was received.

XdsClients will need to be changed to support multiple ADS connections for each
authority. Once the fallback process begins, impacted XdsClient will establish

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/impacted/an impacted/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A71-xds-fallback.md Outdated Show resolved Hide resolved

Connecting to the lower-priority servers does not close gRPC connections to the
higher-priority servers. XdsClient will still wait for xDS resources on the ADS
stream. Once such resource is received, XdsClient will close connections to the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/resource is received, XdsClient/a resource is received, the XdsClient/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@eugeneo eugeneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Updated!

We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to use the resources
the same server
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A71-xds-fallback.md Outdated Show resolved Hide resolved
below).

Changes XdsClient to per-data plane target would enable gRPC to switch to
fallback configuration only for channels that need xDS resource that have not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
easwars
easwars previously approved these changes Dec 18, 2023

Current xDS implementation in gRPC uses a single global instance of
the XdsClient class to communicate with the xDS control plane. This instance
is shared across all channels to reduce overhead by providing a shared cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This instance is shared across all channels and all xDS-enabled gRPC server listeners.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Reservations about using the fallback server data

We are expecting to use this in an environment where the data provided by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to give an example of such an environment? And in what environments would be the converse be true, i.e data from primary and fallback management servers are equally favorable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarification


We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to only use the resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Not just any given gRPC channel, but any given xDS-enabled gRPC server listener.

Maybe we can introduce a new term (something like a gRPC endpoint) to encapsulate both xds-enabled gRPC channel and xds-enabled grpc servers, and use that term wherever applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to only use the resources
the single server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: terminate with a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

A71-xds-fallback.md Show resolved Hide resolved
Comment on lines +67 to +68
gRPC servers using the xDS configuration will share the same XdsClient instance
keyed with a dedicated well-known key value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java and Go support multiple listeners for the same xds-enabled gRPC server. This means that the LDS resource requested by each of these listeners will be different. Instead of using a well-known key and sharing the XdsClient across all these listeners, would it make sense to key it based on the server listener resource name? This would mean that there will be no sharing of XdsClient across server listeners though.

And the same issue can show up in C-core as well, if the application contains more than one xds-enabled grpc server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth I am not familiar with this feature. Any advice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.

It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.

For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.

fallback configuration only for channels that need xDS resources that have not
been downloaded yet.

This change would change the channels to get XdsClient for their data plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSDS change seems independent enough to deserve a separate sub-section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 109 to 113
XdsClients will need to be changed to support multiple ADS connections for each
authority. Once the fallback process begins, an impacted XdsClient will establish
a connection to the next xDS control plane listed in the bootstrap JSON. Then
XdsClient will subscribe to all watched resources on that server and will
update the cache based on the received responses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more details here with example configurations?

It is not clear to me what happens in the following scenario

  • a grpc channel has received its full configuration and is serving traffic
  • a config update adds a new cluster in the route table
  • at the same time the xDS channel moves to TRANSIENT_FAILURE
  • Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE

Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?

Switch happens as soon as a connection to primary server is reestablished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me what happens in the following scenario

  • a grpc channel has received its full configuration and is serving traffic
  • a config update adds a new cluster in the route table
  • at the same time the xDS channel moves to TRANSIENT_FAILURE

(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)

  • Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.

The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.

The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.

Copy link
Contributor Author

@eugeneo eugeneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrated the comments.


Current xDS implementation in gRPC uses a single global instance of
the XdsClient class to communicate with the xDS control plane. This instance
is shared across all channels to reduce overhead by providing a shared cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Reservations about using the fallback server data

We are expecting to use this in an environment where the data provided by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarification


We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to only use the resources
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

We have no guarantee that a combination of resources from different xDS servers
form a valid cohesive configuration, so we cannot make this determination on
a per-resource basis. We need any given gRPC channel to only use the resources
the single server
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +67 to +68
gRPC servers using the xDS configuration will share the same XdsClient instance
keyed with a dedicated well-known key value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth I am not familiar with this feature. Any advice?

fallback configuration only for channels that need xDS resources that have not
been downloaded yet.

This change would change the channels to get XdsClient for their data plane
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 109 to 113
XdsClients will need to be changed to support multiple ADS connections for each
authority. Once the fallback process begins, an impacted XdsClient will establish
a connection to the next xDS control plane listed in the bootstrap JSON. Then
XdsClient will subscribe to all watched resources on that server and will
update the cache based on the received responses.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE

Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?

Switch happens as soon as a connection to primary server is reestablished.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved

Current gRPC xDS implementations only use the first xDS configuration server
listed in the bootstrap JSON document. Fallback implementation requires changes
to use all servers listed in order from highest to lowest priority.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to specify that the order establishes the priority with the first being the highest priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good overall! Just a few remaining comments to address.

Please let me know if you have any questions. Thanks!

Comment on lines +67 to +68
gRPC servers using the xDS configuration will share the same XdsClient instance
keyed with a dedicated well-known key value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.

It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.

For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Comment on lines 109 to 113
XdsClients will need to be changed to support multiple ADS connections for each
authority. Once the fallback process begins, an impacted XdsClient will establish
a connection to the next xDS control plane listed in the bootstrap JSON. Then
XdsClient will subscribe to all watched resources on that server and will
update the cache based on the received responses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me what happens in the following scenario

  • a grpc channel has received its full configuration and is serving traffic
  • a config update adds a new cluster in the route table
  • at the same time the xDS channel moves to TRANSIENT_FAILURE

(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)

  • Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration

Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.

The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.

The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?

As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.

The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.

Copy link
Contributor Author

@eugeneo eugeneo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reviews.

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved

Current gRPC xDS implementations only use the first xDS configuration server
listed in the bootstrap JSON document. Fallback implementation requires changes
to use all servers listed in order from highest to lowest priority.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
A71-xds-fallback.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one minor comment remaining.

A71-xds-fallback.md Outdated Show resolved Hide resolved
@eugeneo eugeneo merged commit eb0d8fc into master Jan 4, 2024
1 check passed
@eugeneo eugeneo deleted the fallback-proposal branch January 4, 2024 23:50
@markdroth
Copy link
Member

Note: This PR was incorrectly merged before the gRFC had been approved. We'll review offline and make any subsequent changes in #407.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a comment about clarity, and we're already discussing the desire to define what string servers use for the key. With those, this LGTM.

gRPC servers using the xDS configuration will share the same XdsClient instance
keyed with a dedicated well-known key value.

Changes XdsClient to per-data plane target would enable gRPC to switch to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be missing a subject. Was this intended to be "Changing XdsClient..."

That's unclear though. Is that a theoretical, "if we made this change then we'd get this result," or is it "we are making this change to get this result." Maybe something like:

gRPC Channel will share XdsClient instances keyed by the data plane target , which would enable..."

And then maybe the "This change would change..." paragraph gets merged in here. Right now it is unclear what "this change" is referring to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a loog at #407 where I tried addressing your comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants