New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A71: xDS Fallback #386
A71: xDS Fallback #386
Conversation
3877c6f
to
b476e6f
Compare
0fb096d
to
020bf70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start!
Please let me know if you have any questions. Thanks!
A71-xds-fallback.md
Outdated
Current xDS implementation in gRPC uses a single global instance of | ||
the XdsClient class to communicate with the xDS control plane. This instance | ||
is shared across all channels to reduce overhead by providing a shared cache | ||
for xDS resources as well as reducing a number of connections to xDS servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/reducing a number/reducing the number/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A71-xds-fallback.md
Outdated
due to spike in demand once the primary server goes down and all the clients | ||
try to refetch the resources. | ||
|
||
#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from the chosen approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.
A71-xds-fallback.md
Outdated
gRPC code should use fallback iff both: | ||
|
||
* primary server not reachable | ||
* at least one watcher exists for a resource that is not cached (where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 cases where a watcher could exist for a non-cached resource
- Didn't get a response before primary server became unreachable
- Primary server was already unreachable and a new request for a resource is made
The code paths for these situations seems like they would be quite different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.
A71-xds-fallback.md
Outdated
gRPC code should use fallback iff both: | ||
|
||
* primary server not reachable | ||
* at least one watcher exists for a resource that is not cached (where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take another look.
A71-xds-fallback.md
Outdated
due to spike in demand once the primary server goes down and all the clients | ||
try to refetch the resources. | ||
|
||
#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I addressed your comments. Please take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure looks much better now!
Please let me know if you have any questions. Thanks!
A71-xds-fallback.md
Outdated
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to use the resources | ||
the same server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/use the resources the same/only use the resources from a single/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A71-xds-fallback.md
Outdated
below). | ||
|
||
Changes XdsClient to per-data plane target would enable gRPC to switch to | ||
fallback configuration only for channels that need xDS resource that have not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/resource/resources/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A71-xds-fallback.md
Outdated
|
||
The fallback process is initiated if both of the following conditions hold: | ||
|
||
* There is a connectivity failure on the ADS stream, as described in [A57]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to clarify between it being either
or both
conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"either" 😀
A71-xds-fallback.md
Outdated
plane becomes unavailable after a resource change notification was received. | ||
|
||
XdsClients will need to be changed to support multiple ADS connections for each | ||
authority. Once the fallback process begins, impacted XdsClient will establish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/impacted/an impacted/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A71-xds-fallback.md
Outdated
|
||
Connecting to the lower-priority servers does not close gRPC connections to the | ||
higher-priority servers. XdsClient will still wait for xDS resources on the ADS | ||
stream. Once such resource is received, XdsClient will close connections to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/resource is received, XdsClient/a resource is received, the XdsClient/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. Updated!
A71-xds-fallback.md
Outdated
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to use the resources | ||
the same server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
A71-xds-fallback.md
Outdated
below). | ||
|
||
Changes XdsClient to per-data plane target would enable gRPC to switch to | ||
fallback configuration only for channels that need xDS resource that have not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A71-xds-fallback.md
Outdated
|
||
Current xDS implementation in gRPC uses a single global instance of | ||
the XdsClient class to communicate with the xDS control plane. This instance | ||
is shared across all channels to reduce overhead by providing a shared cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This instance is shared across all channels and all xDS-enabled gRPC server listeners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
### Reservations about using the fallback server data | ||
|
||
We are expecting to use this in an environment where the data provided by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to give an example of such an environment? And in what environments would be the converse be true, i.e data from primary and fallback management servers are equally favorable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarification
A71-xds-fallback.md
Outdated
|
||
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to only use the resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Not just any given gRPC channel, but any given xDS-enabled gRPC server listener.
Maybe we can introduce a new term (something like a gRPC endpoint) to encapsulate both xds-enabled gRPC channel and xds-enabled grpc servers, and use that term wherever applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A71-xds-fallback.md
Outdated
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to only use the resources | ||
the single server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: terminate with a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
gRPC servers using the xDS configuration will share the same XdsClient instance | ||
keyed with a dedicated well-known key value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java and Go support multiple listeners for the same xds-enabled gRPC server. This means that the LDS resource requested by each of these listeners will be different. Instead of using a well-known key and sharing the XdsClient
across all these listeners, would it make sense to key it based on the server listener resource name? This would mean that there will be no sharing of XdsClient
across server listeners though.
And the same issue can show up in C-core as well, if the application contains more than one xds-enabled grpc server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth I am not familiar with this feature. Any advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.
It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.
For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.
fallback configuration only for channels that need xDS resources that have not | ||
been downloaded yet. | ||
|
||
This change would change the channels to get XdsClient for their data plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSDS change seems independent enough to deserve a separate sub-section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
A71-xds-fallback.md
Outdated
XdsClients will need to be changed to support multiple ADS connections for each | ||
authority. Once the fallback process begins, an impacted XdsClient will establish | ||
a connection to the next xDS control plane listed in the bootstrap JSON. Then | ||
XdsClient will subscribe to all watched resources on that server and will | ||
update the cache based on the received responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add more details here with example configurations?
It is not clear to me what happens in the following scenario
- a grpc channel has received its full configuration and is serving traffic
- a config update adds a new cluster in the route table
- at the same time the xDS channel moves to TRANSIENT_FAILURE
- Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration
The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE
Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.
The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?
Switch happens as soon as a connection to primary server is reestablished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me what happens in the following scenario
- a grpc channel has received its full configuration and is serving traffic
- a config update adds a new cluster in the route table
- at the same time the xDS channel moves to TRANSIENT_FAILURE
(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)
- Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration
Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.
The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.
The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?
As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.
The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrated the comments.
A71-xds-fallback.md
Outdated
|
||
Current xDS implementation in gRPC uses a single global instance of | ||
the XdsClient class to communicate with the xDS control plane. This instance | ||
is shared across all channels to reduce overhead by providing a shared cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
### Reservations about using the fallback server data | ||
|
||
We are expecting to use this in an environment where the data provided by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarification
A71-xds-fallback.md
Outdated
|
||
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to only use the resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A71-xds-fallback.md
Outdated
We have no guarantee that a combination of resources from different xDS servers | ||
form a valid cohesive configuration, so we cannot make this determination on | ||
a per-resource basis. We need any given gRPC channel to only use the resources | ||
the single server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
gRPC servers using the xDS configuration will share the same XdsClient instance | ||
keyed with a dedicated well-known key value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth I am not familiar with this feature. Any advice?
fallback configuration only for channels that need xDS resources that have not | ||
been downloaded yet. | ||
|
||
This change would change the channels to get XdsClient for their data plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
A71-xds-fallback.md
Outdated
XdsClients will need to be changed to support multiple ADS connections for each | ||
authority. Once the fallback process begins, an impacted XdsClient will establish | ||
a connection to the next xDS control plane listed in the bootstrap JSON. Then | ||
XdsClient will subscribe to all watched resources on that server and will | ||
update the cache based on the received responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a config update adds a new cluster in the route table
at the same time the xDS channel moves to TRANSIENT_FAILURE
Depends on whether the channel is trying to obtain updated resource. E.g. there's an inherent race condition whether or not the channel "knows" there was a config change.
The same question applies when a grpc channel has received full configuration from a secondary xds server, and > the primary becomes available. When do we switch to it?
Switch happens as soon as a connection to primary server is reestablished.
A71-xds-fallback.md
Outdated
|
||
Current gRPC xDS implementations only use the first xDS configuration server | ||
listed in the bootstrap JSON document. Fallback implementation requires changes | ||
to use all servers listed in order from highest to lowest priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to specify that the order establishes the priority with the first being the highest priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good overall! Just a few remaining comments to address.
Please let me know if you have any questions. Thanks!
gRPC servers using the xDS configuration will share the same XdsClient instance | ||
keyed with a dedicated well-known key value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's Java that does not support multiple listeners per server; C-core and Go both do.
It's generally fairly rare to have multiple servers or multiple listening addresses in the same binary, and in the cases where it happens, it's most likely the case that the two listeners are closely related and would be fine sharing fate w.r.t. fallback (e.g., serving the same content on a public port with TlsCreds and on a loopback port with InsecureCreds). And in C-core and Go, using a different XdsClient per listening address would add complexity, since we'd need to track different XdsClient instances in the same server.
For now, let's stick with a single key for all servers, regardless of listening address. We can change this later if needed.
A71-xds-fallback.md
Outdated
XdsClients will need to be changed to support multiple ADS connections for each | ||
authority. Once the fallback process begins, an impacted XdsClient will establish | ||
a connection to the next xDS control plane listed in the bootstrap JSON. Then | ||
XdsClient will subscribe to all watched resources on that server and will | ||
update the cache based on the received responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me what happens in the following scenario
- a grpc channel has received its full configuration and is serving traffic
- a config update adds a new cluster in the route table
- at the same time the xDS channel moves to TRANSIENT_FAILURE
(This is basically the mid-update case referenced in the gRFC, but as per my comment elsewhere, we need to clarify the description.)
- Do we switch to the fallback server in the background and wait to receive the full configuration before using it, or do we switch immediately and fail RPCs until we receive the full configuration
Neither. What will happen is that the xds resolver will immediately get the new RDS resource, so it will add the CDS LB policy for the new cluster, and it will return a ConfigSelector that uses the new RouteConfig, so RPCs will immediately start being routed to that cluster. However, the new CDS LB policy will not immediately get any notification on its CDS watch, so it will stay in CONNECTING state until the XdsClient switches over to the fallback server and obtains the CDS resource from there. The result is that we won't fail any RPCs, but we will cause some latency.
The ideal behavior is that the channel would not switch over to the new RouteConfig until it obtains the necessary CDS resource from the fallback server, so that we don't cause that unnecessary latency. However, we can't do that until we implement gRFC A74 (#404), which we decided to split out from this gRFC, since it's really a somewhat separate issue.
The same question applies when a grpc channel has received full configuration from a secondary xds server, and the primary becomes available. When do we switch to it?
As soon as we get the first resource from the primary server, we immediately close the connection to the fallback server, and we update the watchers from the responses we get from the primary server.
The idea here is that the fallback logic is entirely within the XdsClient and is not visible to the XdsClient's watchers in the gRPC channel. The logic in the gRPC channel will just continue to make decisions the way it always has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the reviews.
A71-xds-fallback.md
Outdated
|
||
Current gRPC xDS implementations only use the first xDS configuration server | ||
listed in the bootstrap JSON document. Fallback implementation requires changes | ||
to use all servers listed in order from highest to lowest priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one minor comment remaining.
Note: This PR was incorrectly merged before the gRFC had been approved. We'll review offline and make any subsequent changes in #407. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a comment about clarity, and we're already discussing the desire to define what string servers use for the key. With those, this LGTM.
gRPC servers using the xDS configuration will share the same XdsClient instance | ||
keyed with a dedicated well-known key value. | ||
|
||
Changes XdsClient to per-data plane target would enable gRPC to switch to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be missing a subject. Was this intended to be "Changing XdsClient..."
That's unclear though. Is that a theoretical, "if we made this change then we'd get this result," or is it "we are making this change to get this result." Maybe something like:
gRPC Channel will share XdsClient instances keyed by the data plane target , which would enable..."
And then maybe the "This change would change..." paragraph gets merged in here. Right now it is unclear what "this change" is referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a loog at #407 where I tried addressing your comment.
No description provided.